pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.63k stars 1.63k forks source link

exception thrown if converting chunked arrow Table with struct and dictionary columns to polar Dataframe #16040

Open reductionnist opened 2 weeks ago

reductionnist commented 2 weeks ago

Checks

Reproducible example

table = pa.table({'col1': [1, 2, 3], 'col2': [{'a': 1, 'b': 2}, None, {'a':3, 'b':4}],'col3':pa.array(['A', 'B', 'A'], pa.string()).dictionary_encode()})
table2 = pa.concat_tables([table.slice(0,1),table.slice(0,2)])
pl.from_arrow(table2)

Log output

No response

Issue description

Hi, pl.from_arrow() will throw an exception if a chunked arrow table contains both dictionary and struct columns. This appears to be due to the logic in arrow_to_pydf which will omit the dictionary columns from the df being constructed if there are any struct columns. The following replacement code appears to fix the issue:

    if len(dictionary_cols) > 0 or len(struct_cols) > 0:
        df = wrap_df(pydf)
        df = df.with_columns([F.lit(s).alias(s.name) for s in itertools.chain(dictionary_cols.values(), struct_cols.values())])
        reset_order = True

Expected behavior

it should convert to a Dataframe instead of throwing

Installed versions

``` --------Version info--------- Polars: 0.20.10 Index type: UInt32 Platform: Linux-6.8.7-100.fc38.x86_64-x86_64-with-glibc2.37 Python: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fsspec: 2024.2.0 gevent: hvplot: matplotlib: 3.8.3 numpy: 1.26.4 openpyxl: pandas: 2.2.0 pyarrow: 15.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: 2.0.27 xlsx2csv: xlsxwriter: ```