pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.19k stars 1.84k forks source link

Dangerous default behavior for json_decode if first 100 elements are 'null' #13061

Open severinh opened 9 months ago

severinh commented 9 months ago

Checks

Reproducible example

>>> import polars as pl
>>> 
>>> df = pl.DataFrame({"json_column": ["null"] * 1000 + ["1.0"]})
>>> df.select(pl.col("json_column").str.json_decode()) # The default behavior silently replaces the 1.0 with null.
shape: (1_001, 1)
┌─────────────┐
│ json_column │
│ ---         │
│ null        │
╞═════════════╡
│ null        │
│ null        │
│ null        │
│ null        │
│ …           │
│ null        │
│ null        │
│ null        │
│ null        │
└─────────────┘
>>> df.select(pl.col("json_column").str.json_decode(infer_schema_length=None)) # Here the decoding is correct.
shape: (1_001, 1)
┌─────────────┐
│ json_column │
│ ---         │
│ f64         │
╞═════════════╡
│ null        │
│ null        │
│ null        │
│ null        │
│ …           │
│ null        │
│ null        │
│ null        │
│ 1.0         │
└─────────────┘

Log output

N/A

Issue description

In our system, we read a column with a JSON value from a database into a DataFrame, and then decode it using Polars' str.json_decode().

We had a production incident where the data in the column was mostly (but not entirely) null (sparse data). In particular, the first 100 elements were null. The default behavior of json_decode is to infer the data type from the first 100 values. So in this case, the inferred data type became null, and as a result, the entire column became null. The few sparse values were dropped on the floor, resulting in wrong calculations downstream.

We solved the problem with an emergency release that set infer_schema_length to None, hereby checking the entire column.

Expected behavior

I believe this is a dangerous default behavior in the presence of such sparse data.

I can think of two solutions:

  1. Either set the default value of infer_schema_length to None, since that's the safest option. I realize though that this may not be desirable, since this could potentially be very expensive.
  2. Keep the default value of 100. But don't count null values. This would solve this particular case.

Installed versions

``` --------Version info--------- Polars: 0.19.19 Index type: UInt32 Platform: macOS-14.1.1-arm64-arm-64bit Python: 3.11.6 (main, Oct 2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] ----Optional dependencies---- adbc_driver_manager: 0.4.0 cloudpickle: connectorx: 0.3.2 deltalake: fsspec: gevent: matplotlib: 3.7.2 numpy: 1.26.2 openpyxl: 3.1.2 pandas: 2.1.3 pyarrow: 14.0.1 pydantic: 2.4.2 pyiceberg: pyxlsb: sqlalchemy: 2.0.16 xlsx2csv: 0.8.1 xlsxwriter: 3.1.2 ```
Julian-J-S commented 9 months ago

Or option 3)

Infer Schema from first 100 rows but if it finds different datatypes after that it should raise/error (instead of in this case setting all new values to null which is as you mentioned suuuuuupeeeer dangerous; lucky you even noticed this silent error!)

elliott-omosheye commented 5 months ago

Got burnt by this as well. All my tests passed but started causing errors in production. Created sample test cases to try to understand and could not replicate as they were under 100 values. Can be a dangerous footgun currently!