risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.08k stars 585 forks source link

streaming: epoch-level distributed tracing #10000

Open BugenZhao opened 1 year ago

BugenZhao commented 1 year ago

See https://github.com/risingwavelabs/risingwave/issues/9905#issuecomment-1554059772 for background.

Recently we're putting efforts into improving the stability of RisingWave under a high workload. An observation is that it's common to have the barrier latency increase abnormally after some time, possibly due to performance regression of storage or executor cache as data grows. In this case, we have to spend time investigating the cause of the latency increase and locate the problematic executor.

There's a common technique of "distributed tracing" that tracks an event as it flows through different components of a distributed system, which allows developers to troubleshoot possible issues during that. Typically, this is designed for ah-hoc requests like batch queries or serving point-gets. However, since we're able to cut the infinite streaming job into the granularity of epochs, we can also treat each epoch as a separate finite event to apply it.

By tracing the barrier flows through each executor, we can easily check which executor spends a lot of time handling the data in this epoch.

lmatz commented 1 year ago

It can be the streaming version of explain analyze XXX_NAME_MV_OR_SINK.

fuyufjh commented 1 year ago

This would be really helpful, both for users and for us

BugenZhao commented 1 year ago

After #10315 and #10417, this feature is generally available for developers in local development. 🎉 Updated guides:

Preview

image

How to read this timeline

How to enable distributed tracing

How does it work

How to add more spans or events here

Integration with Grafana

Grafana supports "trace to metrics" and "trace to logs", which enables us to navigate between data in different forms and establish associations for them. We can adopt them to provide better observability in the future.

github-actions[bot] commented 4 months ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄