Tacking: Telemetry - Githubissues

st1page commented 5 months ago

per-cluster

I believe that we do not need to concern ourselves with the specific locations of the expressions and aggregators used by the users within the plan; rather, we want to know their usage ratio in the product. What we care about is the exact expressions and aggregators that users write in their SQL, not the optimized and rewritten expressions.

[ ] Count the usage frequency of each type of aggregator used in streaming/batch queries, with statistics aggregated. Please note:
- With or without distinct and filter clause are considered as different types of aggregators.
- In RisingWave, aggregators such as AVG, VAR_POP, etc., may be rewritten into other aggregators; we need to count the aggregators before the rewrite.
[ ] Count the usage frequency of each type of function used in streaming/batch queries, with statistics aggregated.
- Count the usage before optimizations (such as constant folding) are applied.

per-streaming job

Regarding the analysis of the workload, we need more detailed information rather than simple grouped counts. The impact on the workload is significant depending on whether an aggregation (agg) is placed before or after a join. Therefore, we need to maintain a simple plan tree for each streaming job, with each node containing some telemetry information about itself.

For each streaming job, capture the plan tree without detailed information such as expressions (expr).
- Store attributes for each plan output, including:
- Whether cleaning state with a watermark on join key.
- Whether cleaning state with watermark on interval condition.
- Whether it is append-only.
- (Optional) Whether it is a stream that has been aggregated.
- (Optional) Whether it has been constrained by a temporal filter.
- Specifically for joins, include the following:
- Join type.
- Whether it involves watermark cleanup.
- Whether it uses interval join state cleaning.
- Specifically for aggregations (agg), include the following:
- The number of materialized input and value states.
- The number of distinct keys.
- Whether it involves watermark cleanup.
- Whether it is end-of-window-constraint (eowc) enabled.
applied rules and applied times in HeuristicOptimizer, which has been maintained in https://github.com/risingwavelabs/risingwave/blob/main/src/frontend/src/optimizer/heuristic_optimizer.rs

st1page commented 5 months ago

request for comments c.c. @fuyufjh @tabVersion @chenzl25

fuyufjh commented 5 months ago

Keeping the plan tree, rather than simple number metrics such as count of operators, will introduce more complexity in telemetry backend - now it has to understand the plan tree, and it might need to traverse through the tree to get some detailed information. I am not sure how much complexity it is and whether it's worth.

Let me ask a question. Supposing that we have to write some queries to answer "how many joins per query for a specific user", either on telemetry backend or some subsequent analysis tool such as Grafana, Metabase, etc.. Which one do you prefer? Storing the plan tree or flatten numbers e.g. number of HashJoins in a query.

st1page commented 5 months ago

Keeping the plan tree, rather than simple number metrics such as count of operators, will introduce more complexity in telemetry backend - now it has to understand the plan tree, and it might need to traverse through the tree to get some detailed information. I am not sure how much complexity it is and whether it's worth.

Perhaps a better approach would be to flatten the storage of this tree, storing it within an array and using indices for mutual referencing. This way, we can

preserve the original data structure when we really need it.
store certain statistical data for each operator,
when simpler data is required, we can quickly obtain it by using some aggregation methods on the telemetry backend.

github-actions[bot] commented 1 month ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄

tabVersion commented 1 month ago

A summary of the current status

after https://github.com/risingwavelabs/risingwave/pull/17486, we support tracking feature usage for both source and sink connectors and recovery.
- collect when the creation logic is reached on each compute node and also collect the format/encode info

And the following things are going to be conducted in the following order

applied rules and applied times in HeuristicOptimizer
- This one has reached the conclusion that we can deliver the info when the streaming job is up and get its own catalog_id
capture the plan tree
- This item does not have a clear design yet. From the desc above, the involved attr comes from different steps in the planning stage, and may need more prepare work.

risingwavelabs / risingwave

Tacking: Telemetry #16332

per-cluster

per-streaming job