hyparam / hyparquet

parquet file parser for javascript
MIT License
141 stars 3 forks source link

Write to parquet file? #6

Open caligo-erik opened 2 months ago

caligo-erik commented 2 months ago

Is it possible to write to parquet file using this library? (quickly checked the code, didn't see any write function).

platypii commented 2 months ago

No plans for writing parquet files at this time.

I could be convinced otherwise, but generally I feel that if you are creating parquet files, you are more likely to be in a backend environment so it makes sense to use existing parquet libraries in like python, C++ or Rust.

What I really want with this library is to make it easy to view parquet data in the browser, since there was no good library for decoding parquet files in javascript that was lightweight and could handle remote files efficiently.

You might like the work of @kylebarron on parquet-wasm. Hope you find what you need!

kylebarron commented 2 months ago

lightweight and could handle remote files efficiently.

Generally agree that "webassembly" and "lightweight" are not synonyms, but there's no technical blocker to handling remote files efficiently in parquet-wasm. In the latest release you're able to fetch individual row groups or columns from a Parquet file without downloading the entire file. And we could implement something like pyarrow's filters param, I just haven't taken the time to fully implement that yet.

caligo-erik commented 2 months ago

No plans for writing parquet files at this time.

I could be convinced otherwise, but generally I feel that if you are creating parquet files, you are more likely to be in a backend environment so it makes sense to use existing parquet libraries in like python, C++ or Rust.

What I really want with this library is to make it easy to view parquet data in the browser, since there was no good library for decoding parquet files in javascript that was lightweight and could handle remote files efficiently.

You might like the work of @kylebarron on parquet-wasm. Hope you find what you need!

I'm creating an offline application (with local JS server and web application using Electron) that stores transactional data locally in the backend/server, and then uploads it to S3 to be analyzed with cloud-native tools such as Athena, QuickSight etc.

I'm looking for a lightweight library to read/write to Parquet file, and your library ticks all boxes except for the write function. I've checked other libraries but most aren't maintained.

kylebarron commented 2 months ago

You can use WebAssembly in Electron, so parquet-wasm should work out of the box.

caligo-erik commented 2 months ago

I know, all the data handling hairband in the server which will take care of persisting files and such. The client will be very "stupid" and just display stuff.

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg


From: Kyle Barron @.> Sent: Monday, April 29, 2024 4:39:59 PM To: hyparam/hyparquet @.> Cc: Erik Norman @.>; Author @.> Subject: Re: [hyparam/hyparquet] Write to parquet file? (Issue #6)

You can use WebAssembly in Electron, so parquet-wasm should work out of the box.

— Reply to this email directly, view it on GitHubhttps://github.com/hyparam/hyparquet/issues/6#issuecomment-2082927098, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJZL7L5KDYWUWO3MIFN2QALY7ZLT7AVCNFSM6AAAAABG44IV7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBSHEZDOMBZHA. You are receiving this because you authored the thread.Message ID: @.***>

platypii commented 2 months ago

Generally agree that "webassembly" and "lightweight" are not synonyms, but there's no technical blocker to handling remote files efficiently in parquet-wasm.

parquet-wasm has 5+ megabytes of wasm file, hyparquet is sub-100k of javascript. Loading can be much faster especially for time to first render.

Because hyparquet is not a compiled wasm blob, there is no need for transferring data across the wasm boundary, and no cold-start time for loading the wasm vm. Also I've done some optimizations for the web like if you are fetching a bunch of columns in a rowgroup, it will fetch the data in just one http request instead of multiple round trips. I'm guessing that parquet-wasm, if you can implement ranged-gets, probably doesn't coalesce the requests to save round trip time?

Huge respect for your work Kyle, I love reading your blog about parquet stuff. Definitely not knocking parquet-wasm! Just pointing out the reasons I built hyparquet. :)

kylebarron commented 2 months ago

Just pointing out the reasons I built hyparquet. :)

That's very fair! I think it's valuable to have a pure-JavaScript implementation!

My own bias is that Parquet is an absolutely perfect place for WebAssembly, because Parquet is such a complex spec with such a long tail of complexities. It's not that I don't want a pure-JS implementation; rather my own conclusion was that implementing a stable pure-JS Parquet implementation that supports all encodings and compressions would be an absolutely massive engineering effort. Most previous JS Parquet implementations were eventually abandoned.

Whereas there are a ton of people building databases in Rust, so the Parquet implementation is stable, fast, and loads into a binary representation. Perhaps it's a use case where the benefits of WebAssembly outweigh the costs.

So take encouragement with a hint of skepticism 🙂. If you're able to implement a stable pure-JS Parquet reader, it'll be really impressive!

parquet-wasm has 5+ megabytes of wasm file, hyparquet is sub-100k of javascript. Loading can be much faster especially for time to first render.

1.2MB brotli-compressed 😉 , but yes. We might have alternate use cases; you might care more about time to first render whereas I'm more focused on handling large datasets where Parquet 1.2MB is very small compared to the data savings from Parquet.

I'm guessing that parquet-wasm, if you can implement ranged-gets, probably doesn't coalesce the requests to save round trip time?

It does. Multiple ranges are coalesced by default. The coalesce size is currently 1MB and not configurable though.

kylebarron commented 2 months ago

Also note that the people in loaders.gl are also building a pure-Typescript Parquet implementation, which I think was forked from parquets. It might be worth reaching out to them

caligo-erik commented 2 months ago

Thanks for the additional information.

The application server - the only project managing data - doesn't know anything about any UI or client libraries, that's why I'd rather stick with a JS/TS library to read/write Parquet files. image

severo commented 2 months ago

1.2MB brotli-compressed 😉 , but yes. We might have alternate use cases; you might care more about time to first render whereas I'm more focused on handling large datasets where Parquet 1.2MB is very small compared to the data savings from Parquet.

it's particularly valuable when we're interested only in reading the metadata.

kylebarron commented 2 months ago

it's particularly valuable when we're interested only in reading the metadata.

Your use case involves reading the metadata only... but not the data?

platypii commented 2 months ago

Oh if we're talking compressed size, then hyparquet is 24.1kb compressed :wink:

severo commented 2 months ago

Your use case involves reading the metadata only... but not the data?

Yes, we just launched a Parquet metadata viewer: https://huggingface.co/datasets/HuggingFaceFW/fineweb/tree/main/data/CC-MAIN-2013-20?show_file_info=data%2FCC-MAIN-2013-20%2F000_00000.parquet

It's powered by hyparquet!