pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.32k stars 1.77k forks source link

Add physical plan to `explain`'s #10936

Open universalmind303 opened 10 months ago

universalmind303 commented 10 months ago

Problem description

currently the explain only contains information about the LogicalPlan. But the logical plan doesn't contain a lot of information, such as if the query is using the streaming engine, or the default engine. It would be great if you'd be able to see the physical plan when running explain.

This would help allow users to make more informed decisions about how their queries affect performance & which ones can/can't be performed OOC.

for example, a query of:

〉explain select * from read_csv('/examples/datasets/foods1.csv');
┌───────────────────────────────────┐
│ Logical Plan                      │
│ ---                               │
│ str                               │
╞═══════════════════════════════════╡
│ FAST_PROJECT: [category, calorie… │
│                                   │
│     CSV SCAN …                    │
│     PROJECT */4 COLUMNS           │
└───────────────────────────────────┘

provides no information about what the end physical plan will be.

for reference, datafusion provides both the LogicalPlan and PhysicalPlan when running explain


┌───────────────┬─────────────────────────────────────────────────────────────────────────┐
│ plan_type     │ plan                                                                    │
│ ──            │ ──                                                                      │
│ Utf8          │ Utf8                                                                    │
╞═══════════════╪═════════════════════════════════════════════════════════════════════════╡
│ logical_plan  │ TableScan: csv_scan projection=[category, calories, fats_g, sugars_g]   │
│ physical_plan │ CsvExec: file_groups={1 group: [[]]}, projection=[category, calories,   |
│               │ fats_g, sugars_g], has_header=true                                      │
│               │                                                                         │
└───────────────┴─────────────────────────────────────────────────────────────────────────┘
ritchie46 commented 10 months ago

We also don't guarantee anything. We can switch engine ad-hoc.

universalmind303 commented 10 months ago

But wouldn't most physical plans be deterministic? Especially for ones without the streaming engine enabled?

For streaming/hybrid plans, we'd be at least guaranteed to know n operations until we reach a branch that could switch the engine.

Admittedly, I'm not super familiar with the newer streaming engine, but wouldn't some of those conditional branches causing fallback to default be based off of the file statistics or other metadata we could examine up front? Or do they rely on information that can only be obtained at time of execution?

Maybe it'd be simplest to just start with default engine as that seems like it's likely much more deterministic than ooc/hybrid engine.

ritchie46 commented 10 months ago

The default engine samples on materialized dataframes and chooses different physical branches based on those samples. This is more of a JIT physical branching.

We off-course could mark those nodes, but that information isn't on the nodes logical plan so this will get outdated and will be a manual mapping, which I am not in favor of.

An estimated physical plan could be added though, by implementing Display for the physical nodes.

universalmind303 commented 10 months ago

Could you point me to some examples of this JIT branching?

An estimated physical plan could be added though, by implementing Display for the physical nodes.

I'll see if i can come up with something for this. I think similarly to the logicalPlan we should be able to do graphvis as well as the textual display.

itamarst commented 8 months ago

Some of this information, in particular whether some are operations are streaming, seems like it could be displayed without much work? E.g. FunctionNode has a is_streamable() function, so you could include the output of that in the explanation. Or am I misunderstanding?