pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.47k stars 1.87k forks source link

Allow on-the-fly decompression when reading CSV/JSON data? #8323

Open baggiponte opened 1 year ago

baggiponte commented 1 year ago

Problem description

pandas can decompress json and csv files before reading. As an example:

with pd.read_json("path/to/compressed.json.gz", lines=True, chunksize=100, nrows=1000) as reader:
    for chunk in reader:
        ...

At first I thought polars could/would/should not do it, because I thought that - in order to decompress a file - pandas would have to decompress the whole file under the hood. However, read_json allows reading data in chunks, and I noticed that only the specified chunksize is decompressed.

The bad news is, that this only happens when the engine parameter is set to ujson: when using pyarrow, I noticed that decompression is significantly slower and I assumed this happened because pyarrow has to decompress the whole file (it might be that it simply is slower, but I could't tell).

Do you think that Rust's arrow2 might support that? Or that polars should have this feature?

Would love to help, but unfortunately I am only proficient on the Python side.

lucifermorningstar1305 commented 3 days ago

Has this been integrated into Python now?

ohanf commented 3 days ago

As of 1.7.1 I added support in pl.read_json, please confirm your use case is being met. There was actually support via both of the ndjson read/scan functions before that release. Note that we I believe we do have to decompress the whole file into memory first for both file formats. I don't think we can easily adopt the chunking suggestion with the current implementation, although I only got familiar enough to add what was missing so I could be missing something