ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
346 stars 175 forks source link

ArrowJS and parquetjs interop #84

Closed trxcllnt closed 2 years ago

trxcllnt commented 5 years ago

Great work on parquetjs! I'm opening this issue to discuss whether the project would be interested in tighter interop or integration with ArrowJS. I know pyarrow has parquet integration, so it seems natural that we could have the same thing in JS.

I threw together this proof-of-concept repo today to test the waters. It seems to work out of the box with the Arrow Table's row iterator. I'd be curious whether there's a column-oriented parquetjs writer (for non-compressed types) that would allow us to copy directly from Arrow Vectors' underlying buffers.

I'm not familiar with Parquet's types, so I also had some questions about how they map to Arrow's data types in the data-type mappings.

ArrowJS also has a rich set of zero-dependency io primitives that abstract over node and browsers' respective io primitives (with fairly comprehensive tests). We'd be happy to discuss ways to leverage those if that's something you'd be interested in.

Best, Paul

kessler commented 4 years ago

@trxcllnt hey, sorry for the EXTREMELY delayed reply. I'm not at all familiar with Apache Arrow, however, I'll be glad to have a quick chat/video with you and Paul (@asmuth) to try and answer your questions.

0xgeert commented 3 years ago

Did something ever come out of this?

ali-habibzadeh commented 3 years ago

Did something ever come out of this?

I guess not :(

alippai commented 3 years ago

https://issues.apache.org/jira/browse/ARROW-11593 this can be interesting for this issue

trxcllnt commented 3 years ago

Sorry to leave everybody hanging. No, @kessler and I never connected offline.

kessler commented 3 years ago

@trxcllnt we could still do that :)

kylebarron commented 2 years ago

Maybe relevant to people in this thread: I put together a basic but functional WebAssembly Parquet reader/writer here, which decodes and encodes into Arrow's IPC format.