List.eval giving unexpected results

jackaixin commented 1 month ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame({
    'values': [[0], [0, 2], [0, 2, 4], [2, 4, 0], [4, 0, 8]],
    'weights': [[3], [2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]
})

df.with_columns(
    pl.concat_list('values', 'weights').list.eval(pl.element().slice(0, pl.len() // 2)).alias('values1'),
    pl.concat_list('values', 'weights').list.eval(pl.element().slice(pl.len() // 2, pl.len() // 2)).alias('values2'),
    pl.concat_list('values', 'weights').list.eval(
        pl.element().slice(0, pl.len() // 2).dot(pl.element().slice(pl.len() // 2, pl.len() // 2))
    ).list.first().alias('dot'),
    pl.concat_list('values', 'weights').list.eval(
        (pl.element() * pl.element().shift(pl.len() // 2)).sum()
    ).list.first().alias('dot2'),
    pl.concat_list('values', 'weights').list.eval(
        pl.element().slice(0, pl.element().len() // 2) + pl.element().slice(pl.element().len() // 2, pl.element().len() // 2)
    ).alias('sum'),
    pl.concat_list('values', 'weights').list.eval(
        (pl.element() + pl.element().shift(pl.len() // 2)).slice(pl.len() // 2, pl.len() // 2)
    ).alias('sum2'),
)

Log output

No response

Issue description

I was trying to get dot product of values and weights, and would like to use functions in the list namespace. I haven't found any built-in list.dot so I ended up using list.eval in the hacky way above. But the code above was returning:

shape: (5, 8)
┌───────────┬───────────┬───────────┬───────────┬─────┬──────┬────────────┬────────────┐
│ values    ┆ weights   ┆ values1   ┆ values2   ┆ dot ┆ dot2 ┆ sum        ┆ sum2       │
│ ---       ┆ ---       ┆ ---       ┆ ---       ┆ --- ┆ ---  ┆ ---        ┆ ---        │
│ list[i64] ┆ list[i64] ┆ list[i64] ┆ list[i64] ┆ i64 ┆ i64  ┆ list[i64]  ┆ list[i64]  │
╞═══════════╪═══════════╪═══════════╪═══════════╪═════╪══════╪════════════╪════════════╡
│ [0]       ┆ [3]       ┆ [0]       ┆ [3]       ┆ 0   ┆ 0    ┆ [0]        ┆ [3]        │
│ [0, 2]    ┆ [2, 3]    ┆ [0, 2]    ┆ [2, 3]    ┆ 4   ┆ 6    ┆ [0, 4]     ┆ [2, 5]     │
│ [0, 2, 4] ┆ [1, 2, 3] ┆ [0, 2, 4] ┆ [1, 2, 3] ┆ 20  ┆ 16   ┆ [0, 4, 8]  ┆ [1, 4, 7]  │
│ [2, 4, 0] ┆ [1, 2, 3] ┆ [2, 4, 0] ┆ [1, 2, 3] ┆ 20  ┆ 10   ┆ [4, 8, 0]  ┆ [3, 6, 3]  │
│ [4, 0, 8] ┆ [1, 2, 3] ┆ [4, 0, 8] ┆ [1, 2, 3] ┆ 80  ┆ 28   ┆ [8, 0, 16] ┆ [5, 2, 11] │
└───────────┴───────────┴───────────┴───────────┴─────┴──────┴────────────┴────────────┘

We see that values1 and values2 are what I expected from the pl.element().slice operations, but dot and sum seem to be performing on the first slice itself instead of first_slice.dot(second_slice) or first_slice + second_slice.

Expected behavior

I expect the dot column to be exactly the same as dot2 column, and the sum column to be the same as sum2.

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: 3.8.2 nest_asyncio: 1.6.0 numpy: 1.26.2 openpyxl: pandas: 2.1.4 pyarrow: 15.0.0 pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```

jackaixin commented 1 month ago

Also, please kindly suggest the best way to perform dot product in my example above.

cmdlineluser commented 1 month ago

Yeah, that does not look right:

df = pl.DataFrame({
    'foo': [[1, 2, 3, 4]]
})

df.select(
    pl.col.foo.list.eval(pl.element().slice(0, 2)).alias('x'),
    pl.col.foo.list.eval(pl.element().slice(2, 2)).alias('y'),
    pl.col.foo.list.eval(
        pl.element().slice(0, 2) + pl.element().slice(2, 2)
    ).alias('x + y')
)

# shape: (1, 3)
# ┌───────────┬───────────┬───────────┐
# │ x         ┆ y         ┆ x + y     │
# │ ---       ┆ ---       ┆ ---       │
# │ list[i64] ┆ list[i64] ┆ list[i64] │
# ╞═══════════╪═══════════╪═══════════╡
# │ [1, 2]    ┆ [3, 4]    ┆ [2, 4]    │ # <- ERROR: x + x?
# └───────────┴───────────┴───────────┘

ruoyu0088 commented 1 month ago

Also, please kindly suggest the best way to perform dot product in my example above.

You can use explode and group_by:

df = pl.DataFrame({
    'values': [[0], [0, 2], [0, 2, 4], [2, 4, 0], [4, 0, 8]],
    'weights': [[3], [2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]
})

(
df
.lazy()
.with_row_index()
.explode('values', 'weights')
.group_by('index', maintain_order=True)
.agg(
    'values', 
    'weights',
    (pl.col.values * pl.col.weights).sum().alias('dot')
)
.drop('index')
.collect()
)

shape: (5, 3)
┌───────────┬───────────┬─────┐
│ values    ┆ weights   ┆ dot │
│ ---       ┆ ---       ┆ --- │
│ list[i64] ┆ list[i64] ┆ i64 │
╞═══════════╪═══════════╪═════╡
│ [0]       ┆ [3]       ┆ 0   │
│ [0, 2]    ┆ [2, 3]    ┆ 6   │
│ [0, 2, 4] ┆ [1, 2, 3] ┆ 16  │
│ [2, 4, 0] ┆ [1, 2, 3] ┆ 10  │
│ [4, 0, 8] ┆ [1, 2, 3] ┆ 28  │
└───────────┴───────────┴─────┘

jackaixin commented 1 month ago

@ruoyu0088 thanks for this. I tried another version with explode:

q2 = (
    df2
    .lazy()
    .with_row_index()
    .select(
        'values',
        'weights',
        (pl.col('values').explode() * pl.col('weights').explode()).sum().over('index').alias('dot')
    )
)

q2.collect()

which returns the same result. Performance is similar on the small example above (mine is slightly faster). However, when I applied your version (df.explode.group_by.agg) and mine (explode + over) to a larger dataframe (~3m rows), yours is 2x faster than mine. Do you have an idea why that might be the case?

Another comment to your explode implementation is that it seems to be consuming much more memory than the list.eval version, although the explode.group_by version is faster than list.eval.

pola-rs / polars