Open BugenZhao opened 1 year ago
It can be the streaming version of explain analyze XXX_NAME_MV_OR_SINK
.
This would be really helpful, both for users and for us
After #10315 and #10417, this feature is generally available for developers in local development. 🎉 Updated guides:
x
) is about to inject, ends when the next barrier (with previous epoch x
) is fully collected and committed. This includes the whole lifetime of the epoch x
in the system.|
in the span) indicating that the next barrier is injected. Therefore, the time from this symbol to the end of the span will be the barrier latency of the next barrier.Tracing
with risedev configure
and add use: grafana
and use: tempo
to the RiseDev profile. After launching, navigate to the risingwave_traces
dashboard in Grafana and click on the latest trace ID.RW_TRACING_ENDPOINT
env to its OTLP gRPC server.TracingContext
is introduced in this PR.Barrier
proto and other related request bodies.target: "rw_tracing"
to an event!
to only show it in traces without outputting it into the log.Grafana supports "trace to metrics" and "trace to logs", which enables us to navigate between data in different forms and establish associations for them. We can adopt them to provide better observability in the future.
This issue has been open for 60 days with no activity.
If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity
label.
You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄
See https://github.com/risingwavelabs/risingwave/issues/9905#issuecomment-1554059772 for background.
Recently we're putting efforts into improving the stability of RisingWave under a high workload. An observation is that it's common to have the barrier latency increase abnormally after some time, possibly due to performance regression of storage or executor cache as data grows. In this case, we have to spend time investigating the cause of the latency increase and locate the problematic executor.
There's a common technique of "distributed tracing" that tracks an event as it flows through different components of a distributed system, which allows developers to troubleshoot possible issues during that. Typically, this is designed for ah-hoc requests like batch queries or serving point-gets. However, since we're able to cut the infinite streaming job into the granularity of epochs, we can also treat each epoch as a separate finite event to apply it.
By tracing the barrier flows through each executor, we can easily check which executor spends a lot of time handling the data in this epoch.