reproio / columnify

Make record oriented data to columnar format.
Apache License 2.0
38 stars 6 forks source link

Consider whither using Apache Arrow intermediate representation #12

Open syucream opened 4 years ago

syucream commented 4 years ago

Columnify uses Apache Arrow Schema/Record as an intermediate representation between various input formant and output ( currently only parquet ). It's powerful, fast memory accesses, supports columnar like representation. But Go implementation is not perfect yet e.g. Arrow record type doesn't support some types on its sub fields so it's not still applicable for Columnify. Additionally Arrow Go implementation doesn't support rich data conversion like PyArrow. Finally it's using "only Arrow Schema" as a necessary intermediate data now.

So we have some options to tackle this problems like:

As a tirivial topic, gocredits doesn't work on Go Arrow dependency. https://github.com/reproio/columnify/issues/4

syucream commented 4 years ago

Arrow intermediate records should be memory efficient, will mitigate memory usage! https://github.com/reproio/columnify/issues/44

syucream commented 4 years ago

And also it can validate input data by given schema https://github.com/reproio/columnify/issues/27