pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.11k stars 1.83k forks source link

map_elements applied to dataframe with empty column or batch with empty column returns series with length 0. #17812

Open Baukebrenninkmeijer opened 1 month ago

Baukebrenninkmeijer commented 1 month ago

Checks

Reproducible example

schema = {'id': pl.Int64, 'name': pl.Utf8, 'value': pl.Float64, 'flag': pl.String}
def extract(s: str | None) -> dict:
    if s is None:
        return {k: None for k in schema.keys()}
    try:
        pattern = r'\{([^}]+)\}'
        first_bracket_content = re.search(pattern, s)
        if first_bracket_content:
            content = first_bracket_content.group(1)
            pairs = re.findall(r'(\w+(?:-\w+)*)=([^,}]+)', content)
            return_values = {k: None for k in schema.keys()}
            return_values.update({k: v.strip() for k, v in pairs if k in schema.keys()})
        else:
            return_values = {k: None for k in schema.keys()}
        return return_values

data = {'text_column': [None, None, None, None]}
df = pl.LazyFrame(data)
result = (
    df.select(
        pl.col('text_column').map_elements(extract, return_dtype=pl.Struct).cast(pl.Struct(schema)).alias('struct_col')
    )
    .collect()
)
print(result)
>>>
shape: (0, 1)
┌────────────┐
│ struct_col │
│ ---        │
│ struct[4]  │
╞════════════╡
└────────────┘

data = {
    'text_column': [
        '{id=1, name=test, value=3.14, flag=true}',
        '{id=2, name=example, value=invalid, flag=false}',
        '{}',  # Empty bracket
        'Invalid string',  # Completely invalid input
    ]
}

schema = {'id': pl.Int64, 'name': pl.Utf8, 'value': pl.Float64, 'flag': pl.String}

df = pl.LazyFrame(data)
result2 = (
    df.select(
        pl.col('text_column')
        .map_elements(extract, return_dtype=pl.Struct)
        .cast(pl.Struct(schema))
        .alias('struct_col')
    )
    .collect()
)
print(result2)
>>>
shape: (4, 1)
┌────────────────────────────┐
│ struct_col                 │
│ ---                        │
│ struct[4]                  │
╞════════════════════════════╡
│ {1,"test",3.14,"true"}     │
│ {2,"example",null,"false"} │
│ {null,null,null,null}      │
│ {null,null,null,null}      │
└────────────────────────────┘

Log output

If ran with `with_columns`: 

---------------------------------------------------------------------------
ShapeError                                Traceback (most recent call last)
Cell In[9], line 28
     23 data = {'text_column': [None, None, None, None]}
     25 df = pl.LazyFrame(data)
     26 result = df.with_columns(
     27     pl.col('text_column').map_elements(extract, return_dtype=pl.Struct).cast(pl.Struct(schema)).alias('struct_col')
---> 28 ).collect()
     29 print(result)
     31 data = {
     32     'text_column': [
     33         '{id=1, name=test, value=3.14, flag=true}',
   (...)
     37     ]
     38 }

File ~/Developer/ING/PSS Hardware Monitoring/.conda/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

ShapeError: unable to add a column of length 0 to a DataFrame of height 4

Issue description

When using this structure in a with_columns col, the length misalignment causes this to fail.

Expected behavior

My expectation is, even with skip_nulls=True, would be that the first one still returns a Series with the shape (4, 1), filled with Nulls.

Installed versions

``` --------Version info--------- Polars: 1.2.1 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.11.9 (main, Apr 19 2024, 11:43:47) [Clang 14.0.6 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: 0.18.2 fastexcel: fsspec: 2024.6.1 gevent: great_tables: hvplot: 0.10.0 matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 2.0.0 openpyxl: 3.1.2 pandas: 2.2.2 pyarrow: 16.1.0 pydantic: pyiceberg: sqlalchemy: torch: 2.3.1 xlsx2csv: xlsxwriter: 3.2.0```
cmdlineluser commented 1 month ago

This appears to be a minimal repro:

df = pl.DataFrame({"text_column": [None] * 2})

df.with_columns(
    pl.all().map_elements(lambda x: {"id": None}, return_dtype=pl.Struct)
      .cast(pl.Struct({"id": pl.Int64}))
)
# ShapeError: unable to add a column of length 0 to a DataFrame of height 2

There seems to be some problem by allowing an empty pl.Struct in return_dtype e.g. https://github.com/pola-rs/polars/issues/8141 https://github.com/pola-rs/polars/issues/17181

cmdlineluser commented 1 month ago

Just on the topic of the actual functionality, perhaps it could be done without map_elements:

df.with_columns(
    pl.struct(
        pl.col('text_column')
          .str.extract(r'\{([^}]+)\}')
          .str.extract(rf'{name}=([^,}}]+)')
          .cast(dtype, strict=False)
          .alias(name) 
        for name, dtype in schema.items()
    )
)
Baukebrenninkmeijer commented 1 month ago

@cmdlineluser I'll have a look at this. Before I was iterating through each columns in the with_columns which worked for eager mode, but in lazy mode, all columns got the value of the latest column (possibly another bug). I'll see whether your example works, it would be much cleaner at least.

janikkokot commented 1 month ago

I am having the same issue. By specifying the fields, I managed to get it running.

from my_module import parse_parameter

import polars as pl

df = pl.DataFrame(
    {'parameter1': ['[MS, MS:1001477, SpectraST,]', 
                   '[MOD, MOD:00648, "N,O-diacetylated L-serine",]',
                   None],
     'parameter2': [None, None, None]}
)

df.with_columns(
    pl.all().map_elements(
        parse_parameter, 
        return_dtype=pl.Struct(
            fields=[pl.Field(name, pl.String) for name in 'abcd']
        ),
        )
)

For the parameter1 column it works without fields, for paramter2 column not.

Is it possible somehow to have Nulls in Struct columns? This would be really useful to me since I am validating the Struct against a Pydantic model which can either be Null or should have all fields.

cmdlineluser commented 1 month ago

A question I do not know the answer to is:

What is the correct way to return a schema'd dict from a UDF?

def return_dict(s):
    result = {"a": "1", "b": "two"}
    return result

schema = {"a": pl.Int64, "b": pl.String}

df = pl.DataFrame({"x": 1})
>>> df.with_columns(y = pl.all().map_elements(return_dict, return_dtype=pl.Struct(schema)))
SchemaError: expected output type 'Struct([Field { name: "a", dtype: Int64 }, Field { name: "b", dtype: String }])', got 'Struct([Field { name: "a", dtype: String }, Field { name: "b", dtype: String }])'; set `return_dtype` to the proper datatype

Setting return_dtype=pl.Struct allows you to bypass this error which you then cast: (should this be allowed?)

>>> df.with_columns(y = pl.all().map_elements(return_dict, return_dtype=pl.Struct).cast(pl.Struct(schema)))
shape: (1, 2)
┌─────┬───────────┐
│ x   ┆ y         │
│ --- ┆ ---       │
│ i64 ┆ struct[2] │
╞═════╪═══════════╡
│ 1   ┆ {1,"two"} │
└─────┴───────────┘

But if there are no non-null values - you hit the ShapeError bug.

If we can modify the UDF, we can get a typed dict from a DataFrame, and set the correct return_dtype:

def return_dict(s):
    result = {"a": "1", "b": "two"}
    return pl.DataFrame(result).cast(schema).to_struct().item()

schema = {"a": pl.Int64, "b": pl.String}

df = pl.DataFrame({"x": 1})
df.with_columns(y = pl.all().map_elements(return_dict, return_dtype=pl.Struct(schema)))
# shape: (1, 2)
# ┌─────┬───────────┐
# │ x   ┆ y         │
# │ --- ┆ ---       │
# │ i64 ┆ struct[2] │
# ╞═════╪═══════════╡
# │ 1   ┆ {1,"two"} │
# └─────┴───────────┘
(pl.DataFrame({"x": [None]})
   .with_columns(y = pl.all().map_elements(return_dict, return_dtype=pl.Struct(schema)))
) 
# shape: (1, 2)
# ┌──────┬─────────────┐
# │ x    ┆ y           │
# │ ---  ┆ ---         │
# │ null ┆ struct[2]   │
# ╞══════╪═════════════╡
# │ null ┆ {null,null} │
# └──────┴─────────────┘

But I'm not sure if this is the way this is supposed to be done?

Baukebrenninkmeijer commented 1 month ago

I would be very surprised if this was the intended way. But it's good to have this as a workaround.

@janikkokot So putting the schema in fields rather than a dict works to circumvent this problem?

Baukebrenninkmeijer commented 1 month ago

@cmdlineluser

df.with_columns(
    pl.struct(
        pl.col('text_column')
          .str.extract(r'\{([^}]+)\}')
          .str.extract(rf'{name}=([^,}}]+)')
          .cast(dtype, strict=False)
          .alias(name) 
        for name, dtype in schema.items()
    )
)

This works flawlessly, thanks for suggesting it!