techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
680 stars 35 forks source link

Support duckdb (columnar OLAP "sqlite" with Arrow API) #241

Closed damianr99 closed 2 years ago

damianr99 commented 3 years ago

I'm curious if you looked at enabling duckdb as a backend for tmd, it seems like a perfect match. DuckDB (http://duckdb.org) is an in-process SQL OLAP engine that is heavily vectorized and provides native arrow format for efficient transfer of large datasets (e.g. into numpy / pandas / R)?
It looks like you could use tech.jna to bypass the JDBC library and use the underlying Arrow APIs (see https://github.com/duckdb/duckdb/blob/4e15532eab54e2b1b008bb24e2804c4eb48d1f66/test/api/test_arrow.cpp as an example of using the Arrow API).
Does this look like it could be a good fit? I'm thinking of using a pipeline of duckdb -> techascent -> neanderthal -> deep diamond / GPU] to have an efficient path from SSD to GPU training with minimal format conversions needed. Exciting :)

cnuernber commented 3 years ago

Wow, yes, looks like a good area to research a bit. Agreed that it is potentially lots of awesome. Thanks for bringing this up.

cnuernber commented 3 years ago

One way this could work is to build a fat jar using javacpp to both bind to the c++ api of duckdb and to package up a jar with the os-specific dependency in it. Javacpp is difficult to learn to use but it works very well and duckdb is the sort of system it is designed for.

It doesn't really seem necessary, however, for your use case. Have you tried to build out your pipeline just using tmd/tablecloth api calls and then saving to arrow? The cost of conversion is fairly minimal and at worst it is O(N) of your dataset while your algorithms are often O(n^2) or worse. So in my experience it is can be tough to see major gains even with zero-copy pathways for Arrow which TMD is capable of -- but for instance pandas is not.

TMD relies a lot on some specific aspects of the JVM and it's performance is quite good - note I beat everything by factors including Julia and Pandas across both Parquet and Arrow - so I wouldn't use duckdb as a backend for TMD specifically but I could provide accelerated pathways into/outof duckdb for TMD and we as a community could provide a clojure library that made it easy to use duckdb as I agree it is very powerful and interesting.

If you are interested in building out duckdb with javacpp and Clojure I would be down to help with that as I have a lot of javacpp experience.

cnuernber commented 3 years ago

Or, another probably better way is to build a C layer on top of duck-db and have it install-able. Then you could use the ffi layer in dtype-next - which supersedes tech.jna - to access it. In addition R, Julia, and various other communities could use duckdb in the same way.

damianr99 commented 3 years ago

Thanks for the response (and sorry for the delay in replying). It looks like only the python client library takes advantage of the DuckDB Arrow support today. I'm wondering if I could avoid writing any new code by leveraging libpython-clj (https://github.com/clj-python/libpython-clj) and using the existing duckdb client, but I haven't had chance to try that out yet. If I get success, I'll report back.

cnuernber commented 3 years ago

I am interesting in the results; there are a lot of easy ways to transfer data from python to java included with libpython-clj especially if you convert what you want to a numpy object. I will be very interested if you get better performance for a workload using duckdb than you do if you just use tech.ml.dataset for a given workload.

cnuernber commented 3 years ago

Closing this until there is reason to go further. I don't think there is a strong advantage to using duckdb as opposed to tmd at this point for nearly any operation and what we really need is a JDBC extension pathway that takes/returns sequences of arrow datasets. That is way beyond the scope of what I think our community can achieve.

cnuernber commented 2 years ago

Duckdb now has a c interface. Reopening - game on :-).

cnuernber commented 2 years ago

https://github.com/cnuernber/tmducken

Arrow isn't necessary - the result-set of the C interface already gets you the columns in raw pointer-to-data format. I actually think for most users just using the JDBC pathway will be just fine and it already works with the existing sql bindings as I state in the readme to tmducken. The C interface doesn't include a way to append an arrow file to a table or create a table from an arrow file both of which I think are missing.