risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.78k stars 561 forks source link

Tacking: Telemetry #16332

Open st1page opened 5 months ago

st1page commented 5 months ago

per-cluster

I believe that we do not need to concern ourselves with the specific locations of the expressions and aggregators used by the users within the plan; rather, we want to know their usage ratio in the product. What we care about is the exact expressions and aggregators that users write in their SQL, not the optimized and rewritten expressions.

per-streaming job

Regarding the analysis of the workload, we need more detailed information rather than simple grouped counts. The impact on the workload is significant depending on whether an aggregation (agg) is placed before or after a join. Therefore, we need to maintain a simple plan tree for each streaming job, with each node containing some telemetry information about itself.

st1page commented 5 months ago

request for comments c.c. @fuyufjh @tabVersion @chenzl25

fuyufjh commented 5 months ago

Keeping the plan tree, rather than simple number metrics such as count of operators, will introduce more complexity in telemetry backend - now it has to understand the plan tree, and it might need to traverse through the tree to get some detailed information. I am not sure how much complexity it is and whether it's worth.

Let me ask a question. Supposing that we have to write some queries to answer "how many joins per query for a specific user", either on telemetry backend or some subsequent analysis tool such as Grafana, Metabase, etc.. Which one do you prefer? Storing the plan tree or flatten numbers e.g. number of HashJoins in a query.

st1page commented 5 months ago

Keeping the plan tree, rather than simple number metrics such as count of operators, will introduce more complexity in telemetry backend - now it has to understand the plan tree, and it might need to traverse through the tree to get some detailed information. I am not sure how much complexity it is and whether it's worth.

Perhaps a better approach would be to flatten the storage of this tree, storing it within an array and using indices for mutual referencing. This way, we can

github-actions[bot] commented 1 month ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄

tabVersion commented 1 month ago

A summary of the current status

And the following things are going to be conducted in the following order