Open universalmind303 opened 10 months ago
We also don't guarantee anything. We can switch engine ad-hoc.
But wouldn't most physical plans be deterministic? Especially for ones without the streaming engine enabled?
For streaming/hybrid plans, we'd be at least guaranteed to know n
operations until we reach a branch that could switch the engine.
Admittedly, I'm not super familiar with the newer streaming engine, but wouldn't some of those conditional branches causing fallback to default be based off of the file statistics or other metadata we could examine up front? Or do they rely on information that can only be obtained at time of execution?
Maybe it'd be simplest to just start with default engine as that seems like it's likely much more deterministic than ooc/hybrid engine.
The default engine samples on materialized dataframes and chooses different physical branches based on those samples. This is more of a JIT physical branching.
We off-course could mark those nodes, but that information isn't on the nodes logical plan so this will get outdated and will be a manual mapping, which I am not in favor of.
An estimated physical plan could be added though, by implementing Display
for the physical nodes.
Could you point me to some examples of this JIT branching?
An estimated physical plan could be added though, by implementing Display for the physical nodes.
I'll see if i can come up with something for this. I think similarly to the logicalPlan we should be able to do graphvis as well as the textual display.
Some of this information, in particular whether some are operations are streaming, seems like it could be displayed without much work? E.g. FunctionNode
has a is_streamable()
function, so you could include the output of that in the explanation. Or am I misunderstanding?
Problem description
currently the
explain
only contains information about theLogicalPlan
. But the logical plan doesn't contain a lot of information, such as if the query is using the streaming engine, or the default engine. It would be great if you'd be able to see the physical plan when runningexplain
.This would help allow users to make more informed decisions about how their queries affect performance & which ones can/can't be performed OOC.
for example, a query of:
provides no information about what the end physical plan will be.
for reference, datafusion provides both the LogicalPlan and PhysicalPlan when running
explain