Open asbisen opened 2 years ago
Parquet is definitely already supported, see here or in the package tests ! :) https://github.com/kimmolinna/DuckDB.jl/issues/6#issuecomment-1011489703
Arrow support would allow for much better overall performance. DuckDB already provides the ability to create relations from Arrow Table
This should be possible (in some way) already if duckdb supports it. Would you consider making a PR with this feature?
I don't think I have the technical chops for that (calling c interfaces from Julia), but I can certainly give it a shot and see how it goes.
I have understood that at this point you could fetch data to an ArrowTable but if you look DuckDB-header file there isn't possibility to do like this Python code does
# Transforms Arrow Table -> DuckDB Relation
rel_from_arrow = duckdb.arrow(arrow_table)
But I would say that @hannesmuehleisen is a right person to point us to right direction with this...
@asbisen what did you mean with "better overall performance"? Can you point to a specific case?
@hannes I am assuming that if we can pass results in Arrow directly to Arrow.Table() then we can avoid materializing the results in DataFrames.jl and allocating more memory. Then the Arrow.Table can be queried and transformed using any library that works with Tables.jl interface. Example Query.jl.
And if all works with mmap'd arrow object then there is a potential to work with dataset larger than memory.
Why would you use Arrow Table for queries when you can use DuckDB? Quote from arrow.apache.org: "DuckDB’s state-of-the-art optimizer can push down filters and projections directly into Arrow scans. As a result, only relevant columns and partitions will be read, allowing the system to e.g., take advantage of partition elimination in Parquet files. This significantly accelerates query execution."
Yeah I don't get that either yet.
@kimmolinna @hannes I guess I used a wrong analogy, I use Query.jl
extensively mostly for feature engineering where I can use custom functions written in Julia. Which becomes cumbersome with SQL; whereas Arrow dataset provides Tables.jl interface.
Here is a list of all the packages that supports Tables.jl interface.
So these workflow becomes efficient if I can pass Arrow results from DuckDB. Currently this is possible if I export parquet out from DuckDB's query; But parquet support is not great in Julia so CSV for now. And then read it again from disk for passing the data through other libraries.
DuckDB (Query/Subset) => Query.jl (Feature Engineering) => Impute.jl
DuckDB (Query/Subset) => TimeSeries.jl
This would make DuckDB.jl more composable with the Julia ecosystem. I thought this would be simple because in Python we can do this. Alternatively if we get Tables.jl compatibility that would be great too :)
@asbisen What would be the most common way to this? DBInterfaces.jl or Tables.jl?
FWIW, I think there's a slight misconception here, but correct me if I'm missing something:
You can !always! 'hop into' the Tables.jl ecosystem by taking an object and turning it into a DataFrame.
So if you take your DuckDB.jl query and pop it into a dataframe (like here) you are home free and can use the full tables.jl ecosystem
But...this isn't to say there aren't gains from implementing both DBInterfaces.jl
and Tables.jl
, but it will take programming & maintaining time and effort to get there (which is why it isn't available right now).
Just to spell out the specific use / advantages so we are clear about the chunk of work:
DBInterfaces.jl
would ensure that the database connection API used by DuckDB is aligned and consistent with other DBInterfaces.jl
-supporting packages
Tables.jl
would ensure that DuckDB.jl
supports data object manipulation without always pushing things into a materialized dataframe (which itself satisfies Tables.jl) first
Note: for implementing a Tables.jl interface, I'd look at SQLite.jl
, particularly here
I have unintentionally managed to divert the attention from exposing Arrow
pointer to Tables.jl
interface. I see a big value in exposing the results from DuckDB as pointer to Arrow results. Would save on data conversion and memory.
@asbisen just to probe this a bit further to check understanding...what other (Julia) packages would be able to take this arrow pointer and work with it? Because otherwise, it seems hard to avoid data conversion post DuckDB
Arrow.jl
exposes Tables.jl
interface too. So once Arrow.jl
get's the result from DuckDB those tables can be accessed by any package/program. For example I can use Dagger.jl to process the resultant dataset in parallel using distributed computing. Dagger's DTable can consume any table which exposes Tables.jl
interface.
Ok, cool! So maybe what you are suggesting is instead of DuckDB.jl having a function like todataframe
which materializes things to an object, it might be better if DuckDB.jl was NOT able to do this, and instead passed along a pointer / had a toarrow method which materialized an arrow object (reference). This way DuckDB.jl can handle the querying, etc., and DuckdDB.jl can piggyback on the Arrow.jl tables.jl implementation, Etc?
We are working on the Tables.jl/DBInterface.jl integration for DuckDB
@jeremiahpslewis correct. If you can hand off pointer to arrow results from DuckDB to Arrow.jl
you can use Tables.jl interface on those tables via Arrow.jl
. I feel Julia has one of the most composable ecosystem.
Ok, then may I suggest the following?
(Roughly, perhaps this isn't quite it)
With the goal to have specific, manageable goals with each issue and to split up the Julia/api-level change (2.) and the internals / optimization (3.)?
Is it possible to perform query on arrow files? Julia has a fully featured
Arrow.jl
library using which we can pass a Arrow Table to DuckDB .Something like this would be great, allowing to query on multiple arrow files and returning the result as an ArrowTable.
Python example