pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.21k stars 1.95k forks source link

`scan_ndjson` ignores `n_rows` and `head` and scans entire file #13825

Open openmodlogs opened 9 months ago

openmodlogs commented 9 months ago

Checks

Reproducible example


import time
import polars as pl

test_data = '\n'.join(f'{{"id":{i}, "name": "{i}"}}' for i in range(10_000_000))
with open('big_test_data.json', 'w+') as outfile:
    outfile.write(test_data)

test_data = '\n'.join(f'{{"id":{i}, "name": "{i}"}}' for i in range(100))
with open('small_test_data.json', 'w+') as outfile:
    outfile.write(test_data)

start = time.time()
df = pl.scan_ndjson('small_test_data.json', n_rows=10).collect()
print(time.time() - start)

start = time.time()
df = pl.scan_ndjson('big_test_data.json', n_rows=10).collect()
print(time.time() - start)

start = time.time()
df = pl.scan_ndjson('big_test_data.json').head(10).collect()
print(time.time() - start)

# 0.0070285797119140625
# 6.309820175170898
# 6.420231819152832

Log output

No response

Issue description

scan_json is scanning the entire file no matter which arguments are passed, This also happens when using .head(10) before .collect()

Expected behavior

I expect scan_ndjson to "Stop reading from JSON file after reading n_rows" when providing a value for n_rows. I'd expect the same results when using .head(n) before .collect or .fetch(n)

Installed versions

``` --------Version info--------- Polars: 0.20.4 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.12.0 (tags/v3.12.0:0fb18b0, Oct 2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: 3.8.2 numpy: 1.26.2 openpyxl: pandas: 2.1.4 pyarrow: 14.0.2 pydantic: 2.5.3 pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: 0.8.2 xlsxwriter: ```
openmodlogs commented 9 months ago

I should note I am using polars-lts-cpu and I get the same results with versions 0.20.5 and 0.20.4

taki-mekhalfa commented 9 months ago

I can't reproduce

In [2]: %timeit pl.scan_ndjson('big_test_data.json', n_rows=10).collect()
160 µs ± 829 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [3]: %timeit pl.scan_ndjson('big_test_data.json').collect()
410 ms ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

polars: 0.20.5 cpu: M1