modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.82k stars 652 forks source link

Steps for new SQL Engine: DuckDB #2589

Open Alex-Monahan opened 3 years ago

Alex-Monahan commented 3 years ago

Hi folks!

Have you happened to hear about DuckDB ? It's an embedded database like SQLite, but designed for OLAP/analytics. It is frequently ~20x faster than SQLite in a single thread, but it also supports multiple threads. Another nice feature is that it can execute SQL on top of a Pandas or Arrow data frame without having to insert it into DuckDB first. It uses the Postgres query parser, so it is much more similar to other DBs than SQLite, and supports most SQL operations (Joins, Group Bys, Recursive WITH, arbitrary subqueries, window functions, etc). If you would be looking for a lightweight but fast SQL backend, it could work well!

Along those lines, what is the work required to allow Modin to work with a SQL backend? Is your initial target SQLite? I see some mention of OmniSci DB also.

Thanks! -Alex

devin-petersohn commented 3 years ago

Hi Alex, thanks for posting!

SQLite is one of the APIs that we target given its use in the Python community. SQLite is not currently planned for an execution engine. We do have an ongoing effort for OmnisciDB, which works on CPU and GPU.

I am aware of DuckDB, it is an interesting project. We would not put engineering effort on it until there was a real demand for it. As it is, there are (relatively) not many production workloads on DuckDB given how new it is, so targeting it for an execution currently does not make sense from the engineering resources side. We aim to target execution/SQL engines that are used in production because we want to bootstrap the existing engines used in production and let the data scientists use their preferred API on it.

That said, we will be watching DuckDB because of its promise, and as always pull requests are welcome if you have the cycles to contribute it yourself!

Alex-Monahan commented 1 year ago

It's fantastic that Ponder now supports DuckDB as a back end!

I am not very familiar, so what are the main differences between the Ponder library and Modin? Will the DuckDB support be open sourced, or how does the licensing model work for Ponder?

It's also great to see Intel Capital investing in Ponder!

devin-petersohn commented 1 year ago

@Alex-Monahan thanks and welcome back!

I am not very familiar, so what are the main differences between the Ponder library and Modin?

I'll try to be brief here. Ponder is an extension to Modin that adds additional backends, including DuckDB (data warehouses are also supported as well!). It leverages Modin's architecture, but adds in the capability to generate pandas-compatible SQL and manage metadata, plus some other magic.

Will the DuckDB support be open sourced, or how does the licensing model work for Ponder?

Right now it's closed source, but completely free. We may open source it in the future.

It's also great to see Intel Capital investing in Ponder!

Intel is also a big contributor to Modin 😄. cc @Garra1980 @aregm