EXPLAIN & EXPLAIN ANALYZE PromQL features

bwplotka commented 1 year ago

Hi,

Given the higher complexity of the new PromQL engine we introduce and the fact we are building a framework of optimizers, we would like to have better visibility on how query was optimized and performed. For this, I propose two additional commands/parameters to the query and query_range APIs:

1. Explain

Instead of execution, dry run the query planning, and print the BEFORE and AFTER optimization execution plans.

On PromCon and KubeCon, we demonstrated this as the PromQL keyword:

Dev branch: https://github.com/thanos-io/thanos/pull/5822

I think we can iterate over the text representation and what it shows, but I would love to have an agreement on the API side of things.

2. Analyze

In this mode, I would propose to execute the query and print a similar explanation as explained, just with profile-like information about the operators involved towards important numbers like accumulated latency of each operator's step, processed samples and series. Similar to MySQL's https://dev.mysql.com/blog-archive/mysql-explain-analyze/.

Also similar to PromLens a little bit, just more execution plan-centric, not PromQL statement centric (this is because logical and physical optimizers might optimize PRomQL transparently for efficiency).

Implementation

Input

In my demo implementation, the code looks for the prefix explain before parsing PromQL. I think it's easy to type and play with, but it has many issues. It conflicts with a metric called "explain" (probably not a big issue), and generally, none for tools like formatters and linters supports it (etc.), so we would need to change in PromQL language itself - this would be hard as new PromQL is not yet heavily adopted, and it might be too much to change everyone's PromQL parsers.

Instead I would propose an extra API parameter as we have for other Thanos special things (https://thanos.io/tip/components/query.md/#query-api-overview). e.g.

mode=explain for explain mode so info about optimizations and physical plan
mode=analyze for analyze (profiling) for physical plan
mode=explain&mode=analyze for explain analyze so profiling and information about optimizations together.

Later on we could add explain_formatting parameters for changing formatting etc..

Output

In my demo I used Warnings API in both UI and backend to send formatted string. It works great with current UIs, and it might even work with Grafana at some point (there is some Errors/Warning API). It is hacky however, so ideally we could have dedicated response type in JSON.

~However, I believe using Warnings at the start it's not too bad - it's easy to develop new response API in the future. I would propose sticking to that for now.~

After good comments, let's propose adding a new field explain in response JSON, to not conflict with warnings.

Thoughts/ideas? 🤗

GiedriusS commented 1 year ago

:+1: I'm all up for it just 2 thoughts:

I wonder how much work it would be to use a different field other than warnings for this output :thinking: maybe it would be easy to add support for more fields to the UI? With a separate field for this maybe we could also add extra UI elements to make it more readable/accessible - for example, we could add the ability to hide some subtrees (https://dowjones.github.io/react-dropdown-tree-select/#/story/with-bootstrap-styles)? Of course this is for the v2 of this functionality but still something to think about nonetheless
Perhaps it would be time to unhide the --query.promql-engine parameter? I've been using the new engine for a few weeks now and it works very well :smile:

saswatamcode commented 1 year ago

The URL parameter makes sense. I think explain in PromQL might be difficult at this stage, but maybe at some point we can upstream certain things. 🙂

For output, I think having a thanos_info field might make sense for Thanos APIs? That way we comply with Prometheus HTTP response format, but we can still experiment with extra data and parsing in UI.

For analyze, would we output some .pg.gz file alongside HTTP output?

bwplotka commented 1 year ago

I wonder how much work it would be to use a different field other than warnings for this output thinking maybe it would be easy to add support for more fields to the UI? With a separate field for this maybe we could also add extra UI elements to make it more readable/accessible - for example, we could add the ability to hide some subtrees (https://dowjones.github.io/react-dropdown-tree-select/#/story/with-bootstrap-styles)? Of course this is for the v2 of this functionality but still something to think about nonetheless

Separate field is easy to add, we can do that, what's the argument against warnings? Wrong purpose?

Perhaps it would be time to unhide the --query.promql-engine parameter? I've been using the new engine for a few weeks now and it works very well smile

+1

For analyze, would we output some .pg.gz file alongside HTTP output?

I would love that, yes. Perhaps explain_formatter=pprof field when mode analyze is used?

saswatamcode commented 1 year ago

Separate field is easy to add, we can do that, what's the argument against warnings? Wrong purpose?

I think wrong purpose and to avoid cases, where explain/analyze output and actual warning from query, might clash? We can show both in the same way in UI somehow, but I'd really like to have separate boxes (for eg blue warn, green info). WDYT? 🙂

I would love that, yes. Perhaps explain_formatter=pprof field when mode analyze is used?

+1

bwplotka commented 1 year ago

Updated proposal to have explain field

matej-g commented 1 year ago

Love this idea and the suggestions, I think a URL parameter is the way to go. As I assume this is mostly useful for distributed PromQLs, it would be overkill to change the parser for this (and probably would not be easily accepted upstream). Given that warnings pattern is well established in Thanos, I think this is very similar and fits perfectly for the use case. As @saswatamcode I can imagine in the future few more parameters on top, perhaps also allow choosing formatting (plain text vs JSON) etc.

vanugrah commented 1 year ago

Hey folks 👋 - A lot of exciting work going on here. Is there anywhere I can read/learn more about the new PromQL engine?

GiedriusS commented 1 year ago

@PradyumnaKrishna has recently added support for:

Switching between engines using the user interface;
Implemented Explain() method on the query type inside of the new engine.

Now we are working on adding support for exposing Explain() through the UI: https://github.com/thanos-io/thanos/pull/6346

It looks so good! And it's a solid foundation for future work.

Some next ideas:

Show how much memory has been allocated in each operator and the CPU time taken by each operator;
Expose series gateway stats for each operator and for each StoreAPI individually using data from https://github.com/thanos-io/thanos/pull/6352;
Show flame graphs of CPU/memory usage by each function? i.e. turn on profiling for the goroutine that is spawned by the HTTP request and then expose that;
Show the execution plan before and after optimization.

farodin91 commented 1 year ago

Integration of Promlens would be also awesome which is also doing some performance analysis.

thanos-io / thanos