pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.28k stars 1.85k forks source link

DataFrame.write_json does not convert all rows #14672

Open gvwilson opened 6 months ago

gvwilson commented 6 months ago

birds.csv

Checks

Reproducible example

In the same directory as birds.csv (attached):

import json
import polars as pl
import sys

original = pl.read_csv("birds.csv")
df = original.filter(pl.col("year") == 2021)
print(f"{len(df)} rows after filtering")

as_json = df.write_json(pretty=True, row_oriented=True)
roundtrip = json.loads(as_json)
print(f"{len(roundtrip)} rows after round trip")

restored = pl.DataFrame(roundtrip)
print(f"{len(restored)} restored rows")

produces:

588 rows after filtering
144 rows after round trip
144 restored rows

To confirm this is an actual problem:

grep ,2021, birds.csv | wc
     588     588   34368

I am using Polars 0.20.10 and Python 3.12.1.

Log output

python find_missing_birds.py
avg line length: 58.158203
std. dev. line length: 2.7567902
initial row estimate: 2750
no. of chunks: 8 processed by: 8 threads.
dataframe filtered

Issue description

Performing the equivalent read-filter-convert-roundtrip operation with Pandas 2.2.1 produces the correct result (588 rows).

Expected behavior

The output should be 588 rows. The JSON produced by write_json only includes the first 144 rows. By inspection, I cannot see anything in the dataset that would cause it to stop prematurely: all characters are 7-bit ASCII, and while some num values (the last column of the CSV) are missing, they are well before the point where conversion stops, and the JSON does correctly include null to represent them.

Installed versions

``` --------Version info--------- Polars: 0.20.10 Index type: UInt32 Platform: macOS-14.2.1-arm64-arm-64bit Python: 3.12.1 | packaged by Anaconda, Inc. | (main, Jan 19 2024, 09:45:58) [Clang 14.0.6 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: pydantic: 2.6.1 pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 6 months ago

birds.csv does not seem to be attached.

gvwilson commented 6 months ago

apologies - just tried again and the link now appears in the original ticket filing.