Solving for large stage depths

wjhrdy commented 8 months ago

Describe the feature

Certain dbt models that are combinations of views that have not themselves been materialized cause a lot of strain on the query planner and result in many stages getting made. The stage depth in starburst is currently set to 150, which means a developer or user could quickly run into this problem if they are joining or selecting from a composite data product.

Some type of mechanism to identify nodes that create large explosions in the query plan would be helpful for developers as a heuristic on when to think about materialization.

How this could be implemented:

Traverse the DBT query plan, and take the rendered queries and run EXPLAIN (TYPE DISTRIBUTED, FORMAT JSON) {query}
Use this to build a graph of stages, find inflection points of stage accumulation or when near limit.

The reason I add this here is that this is an issue specific using the features of dbt with starburst. I don't know if it can be solved in this package however.

Describe alternatives you've considered

There is a simple way to estimate model stage complexity, but it is a very bad estimate because it assumes all models are equally dense.

here is an excerpt from my justfile.

# list number of model dependencies for each model in a folder
estimate-model-complexity directory:
#!/usr/bin/env bash
    declare -A lines
    for file in "{{directory}}"/*.sql; do
        filename=$(basename "$file" .sql)
        output=$({{dbt}} list -m +$filename)
        line_count=$(echo "$output" | wc -l)
        line_count=$((line_count - 4))
        lines["$filename"]=$line_count
    done
    for key in "${!lines[@]}"; do
        echo $key: ${lines[$key]}
    done | sort -rn -t: -k2

Who will benefit?

Anyone trying to leverage the modular nature of DBT in starburst views who is trying to figure out the best views to materialize.

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

hovaesco commented 8 months ago

The stage depth in starburst is currently set to 150

This could be changed using https://trino.io/docs/current/admin/properties-query-management.html#query-max-stage-count. What is a sample stage count you are getting?

Use this to build a graph of stages, find inflection points of stage accumulation or when near limit.

How to build something like that and how to use it?

wjhrdy commented 8 months ago

We are getting around 250 for stage depth, prior to materialization.

wseaton commented 8 months ago

@hovaesco here's an example of a dev tool we've made internally: https://gist.github.com/wseaton/0e0cfecb7421bbba6792222e8fec7cf6.

There are some assumptions that it makes about the plan output that might not hold true in practice (the bit about "virtual" stages, for one), but it's been helpful for us in directionally identifying nodes in the dbt graph that contribute to sudden explosions in complexity.

If we could get the direct number of stages from the query planner, and/or more insight into how the plan is generated that would be helpful.

I'm also trying to think of ways to show this to users easier, not sure what makes the most sense. For now, a rendered graph seems to work. We are thinking about embedding this in CI tooling or potentially running this periodically to help as a diagnostic tool.

starburstdata / dbt-trino