Open baggiponte opened 1 year ago
Has this been integrated into Python now?
As of 1.7.1 I added support in pl.read_json
, please confirm your use case is being met. There was actually support via both of the ndjson read/scan functions before that release. Note that we I believe we do have to decompress the whole file into memory first for both file formats. I don't think we can easily adopt the chunking suggestion with the current implementation, although I only got familiar enough to add what was missing so I could be missing something
Problem description
pandas
can decompress json and csv files before reading. As an example:At first I thought
polars
could/would/should not do it, because I thought that - in order to decompress a file -pandas
would have to decompress the whole file under the hood. However,read_json
allows reading data in chunks, and I noticed that only the specifiedchunksize
is decompressed.The bad news is, that this only happens when the
engine
parameter is set toujson
: when usingpyarrow
, I noticed that decompression is significantly slower and I assumed this happened becausepyarrow
has to decompress the whole file (it might be that it simply is slower, but I could't tell).Do you think that Rust's
arrow2
might support that? Or thatpolars
should have this feature?Would love to help, but unfortunately I am only proficient on the Python side.