pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.69k stars 1.9k forks source link

ShapeError when applying map_elements on an Object type column #14531

Open mariasolomon opened 8 months ago

mariasolomon commented 8 months ago

Checks

Reproducible example

import polars as pl

def create_struct_lists_static(row):
    list_libelle_dict = []

    for lang, label in row.items():
        if lang and label:
            list_libelle_dict.append({"lang": lang, "label": label})

    if not list_libelle_dict:
        list_libelle_dict.append({"lang": "", "label": ""})

    return list_libelle_dict

df = pl.DataFrame([
            {'pe_id': 45456, 'libelle_multilangue': {'id': 'Some brand name]'}},
            {'pe_id': 87878, 'libelle_multilangue': {}},
            {'pe_id': 34343, 'libelle_multilangue': {'id': 'Another brand name'}},
            {'pe_id': 767655, 'libelle_multilangue': {'id': 'Anoche another brand name'}}],
            schema= {'pe_id': pl.UInt64, 'libelle_multilangue': pl.Object})

batch = df[:2]
batch = batch.with_columns(pl.col("libelle_multilangue").map_elements(create_struct_lists_static, return_dtype=pl.List(pl.Struct([pl.Field("lang", pl.Utf8), pl.Field("label", pl.Utf8)])))
                        .alias('multilang_label'))

batch.select("multilang_label")

Log output

---------------------------------------------------------------------------
ShapeError                                Traceback (most recent call last)
Cell In[6], line 42
     39 with pl.Config(verbose=True):  
     41     batch = df[:2]
---> 42     batch = batch.with_columns(pl.col("libelle_multilangue").map_elements(create_struct_lists_static, return_dtype=pl.List(pl.Struct([pl.Field("lang", pl.Utf8), pl.Field("label", pl.Utf8)])))
     43                             .alias('multilang_label'))
     45     batch.select("multilang_label")

File ~/project/.tox/py3.9/lib/python3.9/site-packages/polars/dataframe/frame.py:8301, in DataFrame.with_columns(self, *exprs, **named_exprs)
   8155 def with_columns(
   8156     self,
   8157     *exprs: IntoExpr | Iterable[IntoExpr],
   8158     **named_exprs: IntoExpr,
   8159 ) -> DataFrame:
   8160     """
   8161     Add columns to this DataFrame.
   8162 
   (...)
   8299     └─────┴──────┴─────────────┘
   8300     """
-> 8301     return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)

File ~/project/.tox/py3.9/lib/python3.9/site-packages/polars/lazyframe/frame.py:1935, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager)
   1932 if background:
   1933     return InProcessQuery(ldf.collect_concurrently())
-> 1935 return wrap_df(ldf.collect())

ShapeError: unable to add a column of length 4 to a DataFrame of height 2

Issue description

It works fine on any other column types. It works fine if the data frame is composed from a dictionary with list values such as:

df = pl.DataFrame({'pe_id': [45456, 87878, 34343, 767655],
                    'libelle_multilangue': [{'id': 'Some brand name'},
                                            {'id': 'Another brand name'},
                                            {},
                                            {'id': 'Anoche another brand name'}]},
            schema= {'pe_id': pl.UInt64, 'libelle_multilangue': pl.Object})

But it seems to apply the map operation on the initial column of length 4 and not on the batch column of length 2 if the data frame is composed from a list of dictionaries :

df = pl.DataFrame([
            {'pe_id': 45456, 'libelle_multilangue': {'id': 'Some brand name'}},
            {'pe_id': 87878, 'libelle_multilangue': {}},
            {'pe_id': 34343, 'libelle_multilangue': {'id': 'Another brand name'}},
            {'pe_id': 767655, 'libelle_multilangue': {'id': 'Anoche another brand name'}}],
            schema= {'pe_id': pl.UInt64, 'libelle_multilangue': pl.Object})

Expected behavior

The Map operation should work fine on a Object type column of a data frame composed from a list of dictionaries.

image

Installed versions

``` --------Version info--------- Polars: 0.20.8 Index type: UInt32 Platform: macOS-14.3.1-arm64-arm-64bit Python: 3.9.6 (default, Aug 11 2023, 19:44:49) [Clang 15.0.0 (clang-1500.0.40.1)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: 0.3.2 deltalake: fsspec: 2021.11.1 gevent: hvplot: matplotlib: numpy: 1.26.3 openpyxl: pandas: 2.2.0 pyarrow: 13.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: 1.4.27 xlsx2csv: xlsxwriter: ```
cmdlineluser commented 8 months ago

An attempt at a minimal repro.

Both frames appear to be equal:

a = pl.DataFrame([
        {"A": {"id": "foo"}},
        {"A": {}},
    ],
    schema= {"A": pl.Object}
)

b = pl.DataFrame(
    {"A": [{"id": "foo"}, {}]},
    schema= {"A": pl.Object}
)

a.to_dicts() == b.to_dicts()
# True

The callback does appear to receive the "slice" correctly:

a.slice(0, 1).with_columns(pl.col("A").map_elements(lambda x: [print("[DEBUG]:", x), x][1]))
# [DEBUG]: {'id': 'foo'}
# shape: (1, 1)
# ┌───────────┐
# │ A         │
# │ ---       │
# │ struct[1] │
# ╞═══════════╡
# │ {"foo"}   │
# └───────────┘

b.slice(0, 1).with_columns(pl.col("A").map_elements(lambda x: [print("[DEBUG]:", x), x][1]))
# [DEBUG]: {'id': 'foo'}
# shape: (1, 1)
# ┌───────────┐
# │ A         │
# │ ---       │
# │ struct[1] │
# ╞═══════════╡
# │ {"foo"}   │
# └───────────┘

But if you interact with it in any meaningful way, the length mismatch happens:

a.slice(0, 1).with_columns(pl.col("A").map_elements(lambda x: x.get("id")))
# ShapeError: unable to add a column of length 2 to a DataFrame of height 1

b.slice(0, 1).with_columns(pl.col("A").map_elements(lambda x: x.get("id")))
# shape: (1, 1)
# ┌─────┐
# │ A   │
# │ --- │
# │ str │
# ╞═════╡
# │ foo │
# └─────┘