ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.07k stars 586 forks source link

refactor(polars): use `Select` op within polars backend #10005

Open jcrist opened 1 week ago

jcrist commented 1 week ago

Our other SQL backends convert Project/Filter/Sort/Distinct into a single Select operation. This fusion both results in simpler SQL, and results in these operations being (with some exceptions) commutative. In #9923 I added a test for this commutativity which is currently failing for the polars backend since we don't rewrite these queries to Select nodes.

I think the easiest fix would be to use the same rewrites as the SQL backend to generate SQL nodes. With the deprecation (and future removal) of the dask/pandas backends, another option would be to drop Project/Filter/Sort/Distinct entirely internally and only make use of the more general Select op.

cpcloud commented 1 week ago

I think dropping those other operations makes it harder to understand sequences of operations conceptually, so I'd prefer not to drop them.

That's how Ibis used to represent all selections and it was very difficult to understand the process by which expressions moved into SQL (or whatever else).

Splitting things up into separate operations way isn't without trade-offs, but it helps a lot with isolating the various parts of the compilation pipeline and reasoning about what is and isn't allowed (structurally, typing-wise, and optimization-wise).

cpcloud commented 1 week ago

+1 to using Select in the polars backend if that leads to some simplification.

jcrist commented 1 week ago

Splitting things up into separate operations way isn't without trade-offs, but it helps a lot with isolating the various parts of the compilation pipeline and reasoning about what is and isn't allowed (structurally, typing-wise, and optimization-wise).

Doesn't the rewrite system help a bit with alleviating this? Besides the fusion code (project_to_select, merge_select_select, ...) I didn't really see much analysis/rewrite code that dispatched on Project/Filter/Sort/Distinct. If we're always fusing for SQL generation, using Select immediately would avoid a set of rewrites later. If you have a reference to some code that uses the split types for analysis that would be a helpful reference.

cpcloud commented 1 week ago

I believe all the DerefMap code depends on the various operations being split up.

jcrist commented 1 week ago

Ah, it does. Missed that file, thanks.