pola-rs / tpch

MIT License
64 stars 36 forks source link

Improve pandas queries #56

Closed phofl closed 4 months ago

phofl commented 1 year ago

This gives a nice improvement for the pandas queries. Couldn't test them since testing is already broken on main

See https://medium.com/gitconnected/benchmarking-pandas-against-polars-from-a-pandas-pov-554416a863db for some comparisons

ritchie46 commented 1 year ago

Hi @phofl. Thanks for the improvements. I however do think we need to add this as a new class as changing the queries with insights on cardinality, like you do with the is_in filter before applying is very much hand optimized and a very unfair advantage to tools that follow the TPCH rules, e.g. spark, duckdb, (and polars to a lesser extend because we don't run SQL, but we translate 1:1 to the polars API).

TPCH states that you must run the queries, as is. So we cannot change join ordering, do filters on other locations etc. Now I understand that pandas doesn't have a query optimizer, so this part is hard.

Maybe we should add an extra benchmark run for hand optimized queries, but then polars and spark should also allow to do this to make the comparison make sense.

alexander-beedie commented 1 year ago

Interesting! I think @ritchie46's point about overly hand-tuning things (specifically in relation to how TPCH is supposed to run) is valid, as none of the other backends do this (even ours, which would also benefit from join sorting/reordering), but I also think a number of these optimisations are both very fair and should be integrated 😎

For instance, what looks to be the single most effective optimisation is pushing column-selection into the initial parquet read, and I'd definitely support patching that into the benchmarks, which (if I'm reading the graphs in the article correctly) would hugely improve the current timings. Enabling CoW also sounds completely reasonable (together this looks like a 2x gain -or more- for most queries?)

"Benchmarks are hard" - thanks for applying your expertise to improving them!

phofl commented 1 year ago

TPCH states that you must run the queries, as is. So we cannot change join ordering, do filters on other locations etc. Now I understand that pandas doesn't have a query optimizer, so this part is hard.

So summarising: If I'd add it under the hood it's fine, if I do it manually it's a problem? (this is not supposed the be sarcastic, I am curious, because such an optimization is relatively easy to do).

I agree that the third optimisation step violates these purposes, but the first two definitely don't, that is just moving 2 statement that were previously applied after read_parquet into the read_parquet step.

ritchie46 commented 1 year ago

So summarising: If I'd add it under the hood it's fine, if I do it manually it's a problem? (this is not supposed the be sarcastic, I am curious, because such an optimization is relatively easy to do).

projection/predicate pushdown

Well, for projection and predicate pushhown, I think it is a fair hand-optimization, though not completely in line with the benchmark, but it does reflect what a user should do. And I think it is fair to use that for tools that don't support that themselves. We can add the description of the benchmark.

It is important that we don't apply the filter during read in the in-memory benchmark though, which currently is the case, see my comment.

join reordering/ adding filters/ sorts/ other operations

Though, applying is_in filters or doing join reordering is not based on heuristics but needs information about the table size on that point in the query. This is very hard to figure out for a query optimizer and also isn't always faster, so this is hand-optimizing query plans.

phofl commented 1 year ago

Though, applying is_in filters or doing join reordering is not based on heuristics but needs information about the table size on that point in the query. This is very hard to figure out for a query optimizer and also isn't always faster, so this is hand-optimizing query plans.

I agree with you. My question was meant a bit different: This would be fairly easy to add in pandas based on some heuristics (just theoretically). So if I change the pd.merge implementation that would be fine?

ritchie46 commented 1 year ago

Though, applying is_in filters or doing join reordering is not based on heuristics but needs information about the table size on that point in the query. This is very hard to figure out for a query optimizer and also isn't always faster, so this is hand-optimizing query plans.

I agree with you. My question was meant a bit different: This would be fairly easy to add in pandas based on some heuristics (just theoretically). So if I change the pd.merge implementation that would be fine?

Yes, that would be fine, and are you sure this always is beneficial in a merge? No matter what the cardinality/table size is? I am not convinced it is, and otherwise I will also add it to the join implementation. :smile:

The optimal merge/join strategy depends on the cardinality and the size of the tables and that is for the optimizer/implementation to figure out.

phofl commented 1 year ago

It's not as easy as simply adding it. But pandas is doing some operations under the hood that could (in theory) be used to make this faster. We pay some of the cost that goes into figuring out wether this could help anyway. I am not saying that this would definitely work in most cases, that's something I'd have to investigate more.

ritchie46 commented 1 year ago

We pay some of the cost that goes into figuring out wether this could help anyway. I am not saying that this would definitely work in most cases, that's something I'd have to investigate more.

That's the gist of the challenge. ^^

ritchie46 commented 1 year ago

@phofl, I have been thinking about this a bit. I think the optimizations you do should be in another folder, maybe called pandas_optimized.

We should make a distinction between manually optimization and automatic optimization. The latter are the benchmarks we currently have.

Naive TPC SQL / 1:1 translation rules

Users should not do any optimization, but just run the code as it translates from SQL 1:1 in the most natural translation. That is a single query that is able to take an input table/source.

Manually optimized

The other tier, allows optimizations based on the schema.

This can means you are allowed to:

But are not allowed to do:

I think this is most fair. This should allow other tools to also manually optimize and turn of their optimizer as those code cycles might also not be needed anymore.

stinodego commented 4 months ago

As per Ritchie's comment, I am closing this PR.

A fresh PR with manually optimized pandas queries is welcome, but the 'naive' queries must be preserved.