pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.49k stars 1.87k forks source link

`read_ndjson()` and `read_parquet()` behave differently when the input is a list of files with different schemas #18306

Open etiennebacher opened 1 month ago

etiennebacher commented 1 month ago

Checks

Reproducible example

import polars as pl

# ndjson
pl.DataFrame({"x": [1]}).write_ndjson("foo.json")
pl.DataFrame({"y": [1]}).write_ndjson("foo2.json")
pl.read_ndjson(["data.json", "data2.json"])

# parquet
pl.DataFrame({"x": [1]}).write_parquet("foo.parquet")
pl.DataFrame({"y": [1]}).write_parquet("foo2.parquet")
pl.read_parquet(["foo.parquet", "foo2.parquet"])

Log output

# ndjson

shape: (2, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 2]    │
│ null      │
└───────────┘

# parquet

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\etienne\AppData\Roaming\Python\Python311\site-packages\polars\_utils\deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\etienne\AppData\Roaming\Python\Python311\site-packages\polars\_utils\deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\etienne\AppData\Roaming\Python\Python311\site-packages\polars\io\parquet\functions.py", line 208, in read_parquet
    return lf.collect()
           ^^^^^^^^^^^^
  File "C:\Users\etienne\AppData\Roaming\Python\Python311\site-packages\polars\lazyframe\frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.SchemaError: schema names differ at index 0: x != y

Issue description

read_ndjson() and read_parquet() behave differently when the input is a list of files with different schemas:

Expected behavior

Both functions should have the same behavior.

Installed versions

``` --------Version info--------- Polars: 1.5.0 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2023.6.0 gevent: great_tables: hvplot: 0.9.2 matplotlib: 3.7.1 nest_asyncio: 1.5.6 numpy: 1.24.3 openpyxl: pandas: 2.0.3 pyarrow: 12.0.1 pydantic: 2.6.4 pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
ion-elgreco commented 1 month ago

Both should just deepmerge the schemas or have an option to provide the schema