pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

reading a gzip ndjson file handle causes a panic #14593

Open llimllib opened 7 months ago

llimllib commented 7 months ago

Checks

Reproducible example

$ printf '{"alpha": "beta"}\n{"gamma": "delta"}' | gzip > test.log.gz

$ gzcat test.log.gz
{"alpha": "beta"}
{"gamma": "delta"}

$ cat parse.py
import gzip
import polars as pl

print(f"polars version: {pl.__version__}")

with gzip.open("test.log.gz") as f:
    df = pl.read_ndjson(f)

Log output

$ POLARS_VERBOSE=1 python parse.py
polars version: 0.20.10
/private/tmp/polarsbug/parse.py:7: UserWarning: Polars found a filename. Ensure you pass a path to the file instead of a python file object when possible for best performance.
  df = pl.read_ndjson(f)
thread '<unnamed>' panicked at /Users/runner/work/polars/polars/crates/polars-io/src/mmap.rs:80:37:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/private/tmp/polarsbug/parse.py", line 7, in <module>
    df = pl.read_ndjson(f)
  File "/Users/llimllib/.local/share/asdf/installs/python/3.10.12/lib/python3.10/site-packages/polars/io/ndjson.py", line 49, in read_ndjson
    return pl.DataFrame._read_ndjson(
  File "/Users/llimllib/.local/share/asdf/installs/python/3.10.12/lib/python3.10/site-packages/polars/dataframe/frame.py", line 1066, in _read_ndjson
    self._df = PyDataFrame.read_ndjson(
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

Issue description

  1. The polars documentation suggests that passing a file object ought to be valid, so it seems to me like the ought to succeed, and at least to throw a useful error instead of panicking if it's not
  2. It works as expected if you replace f with f.read()
  3. the point for me though is that I would not like to read the file into memory - is there any way to read a gzip-compressed ndjson file without loading the whole thing into memory?

Expected behavior

I expect the script to work, and at the very least to fail with a useful error instead of panicking

Installed versions

``` --------Version info--------- Polars: 0.20.10 Index type: UInt32 Platform: macOS-14.2.1-arm64-arm-64bit Python: 3.10.12 (main, Jun 22 2023, 22:46:35) [Clang 14.0.3 (clang-1403.0.22.14.1)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: 2023.12.2 gevent: hvplot: matplotlib: 3.8.2 numpy: 1.24.4 openpyxl: pandas: 2.1.2 pyarrow: 15.0.0 pydantic: 2.5.3 pyiceberg: pyxlsb: sqlalchemy: 2.0.19 xlsx2csv: xlsxwriter: ```
rongcuid commented 3 months ago

I encounter the same issue, with an lzma file:

# %%
import lzma
import polars as pl

# %%
with lzma.open("time.json.xz", "rt") as f:
    pl.read_ndjson(f)