techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
656 stars 33 forks source link

Arrow, writing nested types. #389

Open archaic opened 7 months ago

archaic commented 7 months ago

Hi, my work requires me to implement writing nested types in arrow format. Currently I use tech.ml.dataset to convert Clojure columnar data into the arrows format for processing in C++. I need to implement the writing of nested vectors and maps in particular (in arrow/dataset->stream!). Is this something you are interested in me contributing for this project? If so any advice appreciated, otherwise will have to do something like maintain a private fork etc.

cnuernber commented 7 months ago

Definitely interested in contributions here - we have read-only support for a subset of this - lists I believe. Let me know if you want to discuss details of the design or anything else. If not and you are fine just researching and implementing that works for us too.

cnuernber commented 6 months ago

@archaic - is this still an issue for you? I would rather see this contribution in tmducken but arrow also makes sense.

archaic commented 4 months ago

I wasn't able to come up with a good solution. I would love to have this functionality, as most of the datasets I work with are annoying enough to have small segments of nested data that are difficult to wrangle into columns. I think it is a difficult problem to handle generically - does the schema get inferred, that is hard in itself?

I ended up falling back to using metosin/malli (for columns with complex schemas) to define schema's for columns, then the raw java arrow library to convert the malli schema's to arrow datatypes and column writers, however this seems like a regression to just be able to use a dataset and arrow/write! automatically.

cnuernber commented 4 months ago

I think we can break the problem up a bit.

Are you writing the small portions of nested data into arrow as its generic map type or its struct type?

Structs are more arrow-friendly I would think but I would be curious which direction you went.

In any case it would be possible to add a map of column->simple schema type to describe both a generic map or a struct with a defined set of datatypes is members. Then the writing system would respect this mapping if provided else write out nested data using arrow's generic map type. This isn't implemented in the writing layer yet and would be the new addition.

Then it seems like the same pathway would apply where you would could use malli to detect the schema type and then just provide some data definition of the schema type in the options to the write method. I think relying on malli to do this is totally reasonable and first class - we are big fans of malli here - and could be an optional dependency and optional method that is available if malli is on the classpath.

So if you have some simple examples and test cases and could upload those along with some of the code specifically the usage of malli and the mapping into arrow-schema-land I could take it from there and get the low level read/write work done.

It would be good to have this done in a solid minimal way and then we can provide similar pathways for duckdb so at least the interface and potentially the datatype detection layers are shared between tmducken and tmd's arrow pathways.

archaic commented 4 months ago

I will write a more detailed response and provide some code over the next few days, but essentially I am wanting to be able to write arbitrary clojure data that has some structure into arrow (similar to what xtdb v2 does).

For example a column with {[42 86 95] 93.2, [88 104 23] 0.3 ... } as {uint8{3,} float32 ...} or {[45 86 95] [103 32], [991 42 58] [88 14], ...} as {uint16{3}, uint8{2} ...} etc. I played around with using both Map and Structs but actually ended up creating separate vectors for the keys and values and adding appropriate metadata (perhaps structs containing "keys" and "vals" would be more appropriate ..., from memory I don't think the arrow Map interface worked well for arbitrary data).

This would be the type of schema I have in malli {:foo/keys [:sequential [:int {:min 1 :max 160}] :foo/vals [:maybe [:double {:min 0}]]}

Then the keys would be dispatched to the appropriate uint8 writer, also I found malli works well with the :maybe for having nil values.