pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.58k stars 1.98k forks source link

`map_elements` confusing columns when looping in lazy mode #11384

Open DrMaphuse opened 1 year ago

DrMaphuse commented 1 year ago

Checks

Reproducible example

import polars as pl

cols = ["a", "b"]
data = pl.DataFrame([["many"], ["no"]], schema=cols).lazy()
values_to_append = {
    "a": 'apples',
    "b": "pears",
}

for col, value in values_to_append.items():
    data = data.with_columns(
        pl.col(col).map_elements(lambda x: f"{x}_{value}").alias(col),
    )
print(data.collect())

shape: (1, 2)
┌────────────┬──────────┐
│ a          ┆ b        │
│ ---        ┆ ---      │
│ str        ┆ str      │
╞════════════╪══════════╡
│ many_pears ┆ no_pears │
└────────────┴──────────┘

Note: I know this can be done in a million other and better ways. My actual problem looks different, this example is just the simplest way I could think of to show what the issue is.

Log output

No response

Issue description

When using the map_elements function in lazy mode, the columns seem to get mixed up.

Expected behavior

The expected result is a DataFrame where “apples” is appended to all elements in column “a” and “pears” is appended to all elements in column “b”. Instead, "pears" is appended in both columns.

Installed versions

``` --------Version info--------- Polars: 0.19.5 Index type: UInt32 Platform: Linux-5.15.0-79-generic-x86_64-with-glibc2.35 Python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) [GCC 9.4.0] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: 2.2.1 connectorx: deltalake: fsspec: 2023.3.0 gevent: matplotlib: 3.6.3 numpy: 1.23.2 openpyxl: 3.0.10 pandas: 1.5.2 pyarrow: 9.0.0 pydantic: 1.10.6 pyiceberg: pyxlsb: sqlalchemy: 1.4.29 xlsx2csv: 0.8 xlsxwriter: 3.1.4 ```
cmdlineluser commented 1 year ago

It's because you're using lambda in a loop.

https://docs.python.org/3/faq/programming.html#why-do-lambdas-defined-in-a-loop-with-different-values-all-return-the-same-result

value ends up being bound to the same for all cases (in this case pears)

You can pass it as a named arg as shown in the faq example:

cols = ["a", "b"]
data = pl.DataFrame([["many"], ["no"]], schema=cols).lazy()

values_to_append = {
    "a": 'apples',
    "b": "pears",
}

for col, value in values_to_append.items():
    data = data.with_columns(
        pl.col(col).map_elements(lambda x, value=value: f"{x}_{value}").alias(col),
    )

print(data.collect())

# shape: (1, 2)
# ┌─────────────┬──────────┐
# │ a           ┆ b        │
# │ ---         ┆ ---      │
# │ str         ┆ str      │
# ╞═════════════╪══════════╡
# │ many_apples ┆ no_pears │
# └─────────────┴──────────┘
DrMaphuse commented 1 year ago

Interesting. Since lambda is instantly called when using eager mode, this problem does not appear then. This seems to have been different in previous versions of polars, since the code used to work until a few versions ago.