Open jenanwise opened 2 years ago
Hi @jenanwise and thank you for your analysis. I think it is a great idea to split the flag in multiple subflags.
We may keep io_parquet_compression = ["io_parquet_compression_zstd", "io_parquet_compression_snappy", ...]
since when reading, users usually want to support all compressions.
@jenanwise , would you like to work on this? It is a good first issue.
@jorgecarleitao Happily!
+1, this is also needed for me to depend on arrow2 and datafusion simultaneously since they both use zstd-sys but they use different versions so it creates a conflict
- advise folks who want finer-grained control to skip the
io_parquet_compression
flag, and instead directly put aparquet2
dependency in theirCargo.toml
, as I'm doing above. This seems rather fragile
Just a note that this works fine for my needs 🤷♂️
https://github.com/kylebarron/parquet-wasm/blob/93c498484f997a85a97c049cf4c1cbacce04fab8/Cargo.toml#L24-L29. I turn on each individual parquet2 dependency as needed.
@kylebarron thanks!
Apologies. I took a brief look — the implementation is trivial but I didn't get around to creating a test set. However, I'm no longer using arrow2
, so the issue is up for grabs.
@jorgecarleitao Btw, it looks like this issue was addressed by #1207
Right now,
arrow2
exposes theio_parquet_compression
feature flag to opt-in to the compression formats for parquet. However, this enables all oflz4
,zstd
,snappy
,gzip
, andbrotli
, each of which can be fairly heavyweight to build. I imagine most folks will only want to use one of the above — e.g., I am only usingzstd
.On my laptop in a bare repository with this
Cargo.toml
:cargo +nightly build --release --timings
Switching to this
Cargo.toml
:It seems like there might be two solutions:
io_parquet_compression
flag, and instead directly put aparquet2
dependency in theirCargo.toml
, as I'm doing above. This seems rather fragile.io_parquet_compression_zstd
etc flags.Thoughts?