splitgraph / seafowl

Analytical database for data-driven Web applications 🪶
https://seafowl.io
Apache License 2.0
388 stars 9 forks source link

Remote tables: filter without sort #463

Closed backkem closed 1 week ago

backkem commented 8 months ago

I was wondering: does the datafusion_remote_tables filter push-down not support sorting? It seems that using filters and limits in the absence of a sort order could lead to un-expected results.

I'd be happy to help address this if this is indeed the case.

gruuya commented 8 months ago

Hey @backkem, that's a good question.

We abide by the TableProvider API set out by DataFusion which doesn't take into account the ORDER BY clause: https://github.com/splitgraph/seafowl/blob/fdd7c4996f9f06385e1d3c398f19c51929ea6c41/datafusion_remote_tables/src/provider.rs#L120-L126

Sorting itself is handled by DataFusion further down the data processing pipeline (i.e. once the data has been fetched) by a plan node above the scanning node in the plan AST.

While in principle filtering and sorting are commutative, the limit doesn't commute with sorting. DataFusion handles this by carefully deciding when to push-down the limit down into the scan (hence why it's an Option<usize>), though I forgot where exactly that occurs.

backkem commented 8 months ago

Thank you for the feedback. I'll try to find some time to look into the directions mentioned in apache/arrow-datafusion#7871.

backkem commented 1 week ago

Closing as this was answered. FYI: We created datafusion-contrib/datafusion-federation to explore the full query federation use-case.