kylebarron / parquet-wasm

Rust-based WebAssembly bindings to read and write Apache Parquet data
https://kylebarron.dev/parquet-wasm/
Apache License 2.0
482 stars 19 forks source link

Object store wasm usage #490

Closed H-Plus-Time closed 2 months ago

H-Plus-Time commented 2 months ago

NB: Starting this a little earlier for design feedback on the object-store-wasm crate itself (primarily around ehttp vs reqwest).

The usage is very straightforward (given all the plumbing is present for ParquetObjectReader), I ended up doing it as a straight swap, leaving the existing HttpFileReader code for posterity/easy switchback.

H-Plus-Time commented 2 months ago

The main question I had was around your experiments with ehttp (as a replacement for reqwest) - did you find it was worthwhile?

I did quite like the lower impedance model (that is, the very fetch-like behaviour), though the lack of header enums was a slight annoyance (the primary reason I've left the reqwest dep in on the object-store-wasm - I'm fairly certain the compiler should just leave in the enums, string conversion logic and strip everything else).

I had hoped that ehttp would be the solution to the streaming problem (having to fake it, that is), but the mpsc channel + spawn_local trick actually ended up being the solve. It looks like it shouldn't be a problem to do the exact same thing with reqwest (I suspect I just didn't see it at the time), so it's really just a question of ergonomics (for me, and any other maintainers), how much appetite there is for porting the changes over to the OG object-store.

H-Plus-Time commented 2 months ago

Ah, also, given it does change the deps (hopefully not significantly), I'll get the delta generation working again in a separate PR (before this one's ready).

kylebarron commented 2 months ago

I ended up doing it as a straight swap, leaving the existing HttpFileReader code for posterity/easy switchback.

🙏 I'm happy to have both in the the codebase for a while as we figure out which design makes more sense.

The main question I had was around your experiments with ehttp (as a replacement for reqwest) - did you find it was worthwhile?

I think I just got to a point where I didn't fully understand Rust-async-streams or something like that, and there was existing code using reqwest that I could look at and figure out how to make work. So I don't think I got to the point where I formed a concrete opinion over whether to use ehttp or reqwest.

I'll get the delta generation working again in a separate PR (before this one's ready).

That would be awesome 🙏

jdoig commented 2 months ago

This is great! I actually sat down last night trying to use parquet-wasm to make my own wasm object-store replacement for Datafusion... It's great to see someone else is already on it :partying_face:

github-actions[bot] commented 2 months ago

Asset Sizes

AssetUncompressed SizeCompressed Size
async_full/parquet_wasm_bg.wasm5.45MB $\color{red}\textbf{+173KB +3\%}$1.28MB $\color{red}\textbf{+63.2KB +5\%}$
slim/parquet_wasm_bg.wasm3.46MB $\color{red}\textbf{+3.92KB +0\%}$544KB $\color{red}\textbf{+1.03KB +0\%}$
sync/parquet_wasm_bg.wasm4.74MB $\color{green}\textbf{-328B -0\%}$1.04MB $\color{red}\textbf{+84B +0\%}$
kylebarron commented 2 months ago

How are you doing on this? Is this ready for a review?

H-Plus-Time commented 2 months ago

Yeah, I've published the MVP version of the crate (didn't feel comfortable contributing a git-pinned dep), I think it's good to go.

kylebarron commented 2 months ago

Can you merge in main? Then I think we should be good to go

kylebarron commented 2 months ago

Amazing, thank you!