`scan_database` feature

lucazanna commented 1 year ago

Problem description

I have reading about Ponder recently, and about people's interest in querying a database / data warehouse using a Python-based library instead of SQL.

The reason for preferring a Python library to SQL are multiple :

DRY
ease of doing visualizations
the ease of calling other Python packages for ML and other data work
etc related to this, Ponder mentions they have 10% of Fortune 100 companies as clients.

Adding a scan_database to Polars would allow Polars to do the same, with the advantages of Polars on top of it:

lazy engine
cleaner and more consistent syntax
streaming
etc

I imagine that this would be a very big undertaking as it would mean converting Polars expressions to multiple SQL dialects. However, even starting with only one dialect would allow Polars to help companies using that particular dialect.

I imagine that focusing on 3-4 SQL dialects would represent a sizable part of the market: Big Query, PostgreSQL, SQL Server (or Azure SQL), Snowflake

ritchie46 commented 1 year ago

We might push some optimizations into a SQL query in the future. I am only thinking about predicates, selections and limits here.

Translating a whole polars plan to SQL seems like a whole can of worms that is far out of scope for this project. We might work with substrait one day, but also that is not something we should be focusing on IMO.

universalmind303 commented 1 year ago

I've done some work on using the AnonymousScan trait to connect to both postgres & mongo, but to ritchie's point, the current execution engine is not optimized for database queries, but for file reads. While the aforementioned pushdowns work fine for simple queries, when performing more complex ones such as joins, the optimizer isn't currently capable of pushing those down,

I suppose that we could extend the functionality of the optimizer & AnonymousScan to allow for custom optimization rules & pushdowns. That'd atleast open up the door for supporting this kind of functionality.

@ritchie46 if you want, I already have a working implementation of reading from postgres that supports predicate, projection & slice pushdowns. I'd be happy to submit a PR for it under a feature flag, or even a separate crate.

lucazanna commented 1 year ago

if @ritchie46 agrees, having predicate, projection, and slice pushdown for Postgres would be a great addition

henryharbeck commented 12 months ago

Hi @ritchie46 and @alexander-beedie, from my understanding, if/once a visitor is exposed on the python side, then the predicates, selections and limits could be pulled out of a lazy query plan? Is that correct? From there, would that open the door to be able to push things down to the database?

If that's accurate, then this issue would be the enabler.

If we get to the stage where you have something ready to push down to a database, what are your thoughts on using sqlglot as a tool to translate SQL into different dialects? With this tool once you have "polars SQL", you could have them "all" (all that are supported by sqlglot - list here)

If this was something you and the polars dev team were open to, then I'd be happy to work on an implementation / PR on their side.

In terms of adoption, they do have a few projects using them (and I think dbt-duckdb will be added there soon too).

Obviously the counter to this is that it adds another optional dependecy, but there are already a fair few in read/write database. I've not worked with sqlglot before, so potentially it would need to be evaluated for performance too.

Interested to hear your thoughts. Thanks

mutecamel commented 11 months ago

I am only thinking about predicates, selections and limits here.

Projection/predicate push-down into columnar databases like duckdb or clickhouse would greatly reduce disk reading workloads. This is the real strength of columnar data. I hate manually writing raw SQL strings in my python script to select and filter, especially after I am spoiled by Polars expressions. Looking forward to this great feature, thanks!

jonashaag commented 10 months ago

It would be really handy to have streaming support for reading query results. My use case:

pl.scan_database(...).stuff().sink_parquet("stuff.parquet")

rupurt commented 5 months ago

Would also love to have scan_database with sink support. It would be super useful for exporting from ODBC

fdcastel commented 4 months ago

I imagine that this would be a very big undertaking as it would mean converting Polars expressions to multiple SQL dialects.

Just passing to add: With SQLAlchemy this is a solved problem.

We "just" need to convert polars expressions to SQLALchemy queries. To quote the documentation:

The SQL Expression Language is a toolkit (...) which provides a system of constructing SQL expressions represented by composable objects, which can then be “executed” against a target database within the scope of a specific transaction, returning a result set.

Several examples of the expressive power of this language (with JOINs, GROUP BY, CTEs) can be found in SQLAlchemy tutorial.

douglas-raillard-arm commented 3 months ago

If I understand correctly, having support for projection/slice/predicate pushdown in a scan_database() function would allow efficient reading of remote parquet file once combined with something like that: https://github.com/voltrondata/flight-sql-server-example

The parquet file is loaded as a view in duckdb to be read lazily
polars sends a SQL query to the server over arrow flight SQL
The server sends back the arrow data
polars ingests this arrow data via the arrow flight SQL ADBC driver

pickfire commented 1 month ago

Would be great to have a sink_database too, one use case is stream from parquet to database like https://github.com/adriangb/pgpq

pola-rs / polars

`scan_database` feature #9091

Problem description