jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 222 forks source link

Separate flags for parquet2 compression formats? #920

Open jenanwise opened 2 years ago

jenanwise commented 2 years ago

Right now, arrow2 exposes the io_parquet_compression feature flag to opt-in to the compression formats for parquet. However, this enables all of lz4, zstd, snappy, gzip, and brotli, each of which can be fairly heavyweight to build. I imagine most folks will only want to use one of the above — e.g., I am only using zstd.

On my laptop in a bare repository with thisCargo.toml:

[dependencies]
arrow2 = { version = "0.10.1", features = ["io_parquet"], default-features = false }

cargo +nightly build --release --timings

Unit Total Codegen
1. arrow2 v0.10.1 15.1s 11.1s (73%)
2. bitpacking v0.8.4 14.9s 7.9s (53%)
3. zstd-sys v1.6.3+zstd.1.5.2 build script (run) 12.5s
4. brotli v3.3.3 10.4s 8.0s (77%)
5. parquet-format-async-temp v0.2.0 8.3s 5.6s (67%)  
6. arrow-format v0.4.0 5.8s 4.3s (74%)
7. serde_derive v1.0.136 5.6s
8. parquet2 v0.10.3 4.9s 4.1s (83%)
9. syn v1.0.89 4.1s 1.9s (46%)
10. futures-util v0.3.21 4.0s 0.7s (16%)
11. brotli-decompressor v2.3.2 3.9s 3.3s (85%)
12. serde v1.0.136 3.2s 0.4s (11%)
13. lz4-sys v1.9.3 build script (run) 2.9s  

Switching to this Cargo.toml:

[dependencies]
arrow2 = { version = "0.10.1", features = ["io_parquet"], default-features = false }
parquet2 = { version = "0.10", default_features = false, features = ["zstd"] }
Unit Total Codegen
1. arrow2 v0.10.1 13.5s 11.1s (82%)
2. bitpacking v0.8.4 12.5s 5.0s (40%)
3. zstd-sys v1.6.3+zstd.1.5.2 build script (run) 8.4s
4. parquet-format-async-temp v0.2.0 6.4s 3.9s (61%)
5. serde_derive v1.0.136 5.0s
6. futures-util v0.3.21 4.2s 1.0s (25%)
7. arrow-format v0.4.0 3.7s 2.5s (66%)
8. serde v1.0.136 3.2s 0.5s (16%)
9. syn v1.0.89 2.9s 1.3s (46%)
10. chrono v0.4.19 1.8s 1.1s (62%)
11. futures-macro v0.3.21 1.7s
12. parquet2 v0.10.3 1.6s 1.1s (67%)
13. async-trait v0.1.52 1.6s

It seems like there might be two solutions:

  1. advise folks who want finer-grained control to skip the io_parquet_compression flag, and instead directly put a parquet2 dependency in their Cargo.toml, as I'm doing above. This seems rather fragile.
  2. add individual io_parquet_compression_zstd etc flags.

Thoughts?

jorgecarleitao commented 2 years ago

Hi @jenanwise and thank you for your analysis. I think it is a great idea to split the flag in multiple subflags.

We may keep io_parquet_compression = ["io_parquet_compression_zstd", "io_parquet_compression_snappy", ...] since when reading, users usually want to support all compressions.

jorgecarleitao commented 2 years ago

@jenanwise , would you like to work on this? It is a good first issue.

jenanwise commented 2 years ago

@jorgecarleitao Happily!

ozgrakkurt commented 2 years ago

+1, this is also needed for me to depend on arrow2 and datafusion simultaneously since they both use zstd-sys but they use different versions so it creates a conflict

kylebarron commented 2 years ago
  1. advise folks who want finer-grained control to skip the io_parquet_compression flag, and instead directly put a parquet2 dependency in their Cargo.toml, as I'm doing above. This seems rather fragile

Just a note that this works fine for my needs 🤷‍♂️
https://github.com/kylebarron/parquet-wasm/blob/93c498484f997a85a97c049cf4c1cbacce04fab8/Cargo.toml#L24-L29. I turn on each individual parquet2 dependency as needed.

ozgrakkurt commented 2 years ago

@kylebarron thanks!

jenanwise commented 1 year ago

Apologies. I took a brief look — the implementation is trivial but I didn't get around to creating a test set. However, I'm no longer using arrow2, so the issue is up for grabs.

MostlyAmiable commented 11 months ago

@jorgecarleitao Btw, it looks like this issue was addressed by #1207