kimmolinna / DuckDB.jl

MIT License
29 stars 5 forks source link

reading arrow files #11

Open asbisen opened 2 years ago

asbisen commented 2 years ago

Is it possible to perform query on arrow files? Julia has a fully featured Arrow.jl library using which we can pass a Arrow Table to DuckDB .

Something like this would be great, allowing to query on multiple arrow files and returning the result as an ArrowTable.

arrow_files = ["a1.arrow", "a2.arrow", "a3.arrow"]
tbl = Arrow.Table(arrow_files)

db = DuckDB.open(":memory")
con = DuckDB.connect(db)

q = "SELECT count(*) from tbl;"
r = DuckDB.execute(con, q)

arrow_result = DuckDB.toArrowTable(r)

Python example

arrow_table = pq.read_table('integers.parquet')
con = duckdb.connect()

print(con.execute('SELECT SUM(data) FROM arrow_table WHERE data > 50').fetchone())
con.execute("SELECT * FROM arrow_table").fetch_arrow_table()
jeremiahpslewis commented 2 years ago

Parquet is definitely already supported, see here or in the package tests ! :) https://github.com/kimmolinna/DuckDB.jl/issues/6#issuecomment-1011489703

asbisen commented 2 years ago

Arrow support would allow for much better overall performance. DuckDB already provides the ability to create relations from Arrow Table

jeremiahpslewis commented 2 years ago

This should be possible (in some way) already if duckdb supports it. Would you consider making a PR with this feature?

asbisen commented 2 years ago

I don't think I have the technical chops for that (calling c interfaces from Julia), but I can certainly give it a shot and see how it goes.

kimmolinna commented 2 years ago

I have understood that at this point you could fetch data to an ArrowTable but if you look DuckDB-header file there isn't possibility to do like this Python code does

# Transforms Arrow Table -> DuckDB Relation
rel_from_arrow = duckdb.arrow(arrow_table)
kimmolinna commented 2 years ago

But I would say that @hannesmuehleisen is a right person to point us to right direction with this...

hannes commented 2 years ago

@asbisen what did you mean with "better overall performance"? Can you point to a specific case?

asbisen commented 2 years ago

@hannes I am assuming that if we can pass results in Arrow directly to Arrow.Table() then we can avoid materializing the results in DataFrames.jl and allocating more memory. Then the Arrow.Table can be queried and transformed using any library that works with Tables.jl interface. Example Query.jl.

And if all works with mmap'd arrow object then there is a potential to work with dataset larger than memory.

kimmolinna commented 2 years ago

Why would you use Arrow Table for queries when you can use DuckDB? Quote from arrow.apache.org: "DuckDB’s state-of-the-art optimizer can push down filters and projections directly into Arrow scans. As a result, only relevant columns and partitions will be read, allowing the system to e.g., take advantage of partition elimination in Parquet files. This significantly accelerates query execution."

hannes commented 2 years ago

Yeah I don't get that either yet.

asbisen commented 2 years ago

@kimmolinna @hannes I guess I used a wrong analogy, I use Query.jl extensively mostly for feature engineering where I can use custom functions written in Julia. Which becomes cumbersome with SQL; whereas Arrow dataset provides Tables.jl interface.

Here is a list of all the packages that supports Tables.jl interface.

So these workflow becomes efficient if I can pass Arrow results from DuckDB. Currently this is possible if I export parquet out from DuckDB's query; But parquet support is not great in Julia so CSV for now. And then read it again from disk for passing the data through other libraries.

DuckDB (Query/Subset) => Query.jl (Feature Engineering) => Impute.jl DuckDB (Query/Subset) => TimeSeries.jl

This would make DuckDB.jl more composable with the Julia ecosystem. I thought this would be simple because in Python we can do this. Alternatively if we get Tables.jl compatibility that would be great too :)

kimmolinna commented 2 years ago

@asbisen What would be the most common way to this? DBInterfaces.jl or Tables.jl?

jeremiahpslewis commented 2 years ago

FWIW, I think there's a slight misconception here, but correct me if I'm missing something:

You can !always! 'hop into' the Tables.jl ecosystem by taking an object and turning it into a DataFrame.

So if you take your DuckDB.jl query and pop it into a dataframe (like here) you are home free and can use the full tables.jl ecosystem

But...this isn't to say there aren't gains from implementing both DBInterfaces.jl and Tables.jl, but it will take programming & maintaining time and effort to get there (which is why it isn't available right now).

Just to spell out the specific use / advantages so we are clear about the chunk of work:

Note: for implementing a Tables.jl interface, I'd look at SQLite.jl, particularly here

asbisen commented 2 years ago

I have unintentionally managed to divert the attention from exposing Arrow pointer to Tables.jl interface. I see a big value in exposing the results from DuckDB as pointer to Arrow results. Would save on data conversion and memory.

jeremiahpslewis commented 2 years ago

@asbisen just to probe this a bit further to check understanding...what other (Julia) packages would be able to take this arrow pointer and work with it? Because otherwise, it seems hard to avoid data conversion post DuckDB

asbisen commented 2 years ago

Arrow.jl exposes Tables.jl interface too. So once Arrow.jl get's the result from DuckDB those tables can be accessed by any package/program. For example I can use Dagger.jl to process the resultant dataset in parallel using distributed computing. Dagger's DTable can consume any table which exposes Tables.jl interface.

jeremiahpslewis commented 2 years ago

Ok, cool! So maybe what you are suggesting is instead of DuckDB.jl having a function like todataframe which materializes things to an object, it might be better if DuckDB.jl was NOT able to do this, and instead passed along a pointer / had a toarrow method which materialized an arrow object (reference). This way DuckDB.jl can handle the querying, etc., and DuckdDB.jl can piggyback on the Arrow.jl tables.jl implementation, Etc?

hannes commented 2 years ago

We are working on the Tables.jl/DBInterface.jl integration for DuckDB

asbisen commented 2 years ago

@jeremiahpslewis correct. If you can hand off pointer to arrow results from DuckDB to Arrow.jl you can use Tables.jl interface on those tables via Arrow.jl. I feel Julia has one of the most composable ecosystem.

jeremiahpslewis commented 2 years ago

Ok, then may I suggest the following?

(Roughly, perhaps this isn't quite it)

  1. Close this issue
  2. Open an issue 'replace DataFrames.jl support with Tables.jl'
  3. Open a separate issue 'refactor Tables.jl support to use Arrow.jl'

With the goal to have specific, manageable goals with each issue and to split up the Julia/api-level change (2.) and the internals / optimization (3.)?