streaming: epoch-level distributed tracing

BugenZhao commented 1 year ago

See https://github.com/risingwavelabs/risingwave/issues/9905#issuecomment-1554059772 for background.

Recently we're putting efforts into improving the stability of RisingWave under a high workload. An observation is that it's common to have the barrier latency increase abnormally after some time, possibly due to performance regression of storage or executor cache as data grows. In this case, we have to spend time investigating the cause of the latency increase and locate the problematic executor.

There's a common technique of "distributed tracing" that tracks an event as it flows through different components of a distributed system, which allows developers to troubleshoot possible issues during that. Typically, this is designed for ah-hoc requests like batch queries or serving point-gets. However, since we're able to cut the infinite streaming job into the granularity of epochs, we can also treat each epoch as a separate finite event to apply it.

By tracing the barrier flows through each executor, we can easily check which executor spends a lot of time handling the data in this epoch.

lmatz commented 1 year ago

It can be the streaming version of explain analyze XXX_NAME_MV_OR_SINK.

fuyufjh commented 1 year ago

This would be really helpful, both for users and for us

BugenZhao commented 1 year ago

After #10315 and #10417, this feature is generally available for developers in local development. 🎉 Updated guides:

Preview

How to read this timeline

Different colors represent different nodes in the cluster.
Click on the span to get more structural information.
The meta span starts when a barrier (with current epoch x) is about to inject, ends when the next barrier (with previous epoch x) is fully collected and committed. This includes the whole lifetime of the epoch x in the system.
There'll be an "info event" (marked as a black | in the span) indicating that the next barrier is injected. Therefore, the time from this symbol to the end of the span will be the barrier latency of the next barrier.
For each actor or executor, the span starts when it is first polled after the barrier is yielded, ends just before it yields the next barrier.

How to enable distributed tracing

For local development, enable Tracing with risedev configure and add use: grafana and use: tempo to the RiseDev profile. After launching, navigate to the risingwave_traces dashboard in Grafana and click on the latest trace ID.

For manual deployments, start any OTLP-compatible tracing service and set RW_TRACING_ENDPOINT env to its OTLP gRPC server.
For cloud deployment, the managed service is a WIP.

How does it work

The core idea is to serialize the tracing span to a trace context and propagate it through the wire. A utility named TracingContext is introduced in this PR.
For standard RPC calls, a middleware can be introduced and do this automatically in the request headers. (not in these PRs)
There's no intuitive client-server hierarchy for barrier propagation. So we put the trace context manually in the Barrier proto and other related request bodies.

How to add more spans or events here

All existing logs (events) will be recorded in traces, following the same filtering configuration as stdout.
Add target: "rw_tracing" to an event! to only show it in traces without outputting it into the log.
To add spans in the scope of the execution, follow the documentation here.
In the future, we may add tracing to meta RPCs and batch queries as well.

Integration with Grafana

Grafana supports "trace to metrics" and "trace to logs", which enables us to navigate between data in different forms and establish associations for them. We can adopt them to provide better observability in the future.

github-actions[bot] commented 2 months ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄

risingwavelabs / risingwave