pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.67k stars 1.79k forks source link

List.eval giving unexpected results #17370

Open jackaixin opened 1 month ago

jackaixin commented 1 month ago

Checks

Reproducible example

df = pl.DataFrame({
    'values': [[0], [0, 2], [0, 2, 4], [2, 4, 0], [4, 0, 8]],
    'weights': [[3], [2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]
})

df.with_columns(
    pl.concat_list('values', 'weights').list.eval(pl.element().slice(0, pl.len() // 2)).alias('values1'),
    pl.concat_list('values', 'weights').list.eval(pl.element().slice(pl.len() // 2, pl.len() // 2)).alias('values2'),
    pl.concat_list('values', 'weights').list.eval(
        pl.element().slice(0, pl.len() // 2).dot(pl.element().slice(pl.len() // 2, pl.len() // 2))
    ).list.first().alias('dot'),
    pl.concat_list('values', 'weights').list.eval(
        (pl.element() * pl.element().shift(pl.len() // 2)).sum()
    ).list.first().alias('dot2'),
    pl.concat_list('values', 'weights').list.eval(
        pl.element().slice(0, pl.element().len() // 2) + pl.element().slice(pl.element().len() // 2, pl.element().len() // 2)
    ).alias('sum'),
    pl.concat_list('values', 'weights').list.eval(
        (pl.element() + pl.element().shift(pl.len() // 2)).slice(pl.len() // 2, pl.len() // 2)
    ).alias('sum2'),
)

Log output

No response

Issue description

I was trying to get dot product of values and weights, and would like to use functions in the list namespace. I haven't found any built-in list.dot so I ended up using list.eval in the hacky way above. But the code above was returning:

shape: (5, 8)
┌───────────┬───────────┬───────────┬───────────┬─────┬──────┬────────────┬────────────┐
│ values    ┆ weights   ┆ values1   ┆ values2   ┆ dot ┆ dot2 ┆ sum        ┆ sum2       │
│ ---       ┆ ---       ┆ ---       ┆ ---       ┆ --- ┆ ---  ┆ ---        ┆ ---        │
│ list[i64] ┆ list[i64] ┆ list[i64] ┆ list[i64] ┆ i64 ┆ i64  ┆ list[i64]  ┆ list[i64]  │
╞═══════════╪═══════════╪═══════════╪═══════════╪═════╪══════╪════════════╪════════════╡
│ [0]       ┆ [3]       ┆ [0]       ┆ [3]       ┆ 0   ┆ 0    ┆ [0]        ┆ [3]        │
│ [0, 2]    ┆ [2, 3]    ┆ [0, 2]    ┆ [2, 3]    ┆ 4   ┆ 6    ┆ [0, 4]     ┆ [2, 5]     │
│ [0, 2, 4] ┆ [1, 2, 3] ┆ [0, 2, 4] ┆ [1, 2, 3] ┆ 20  ┆ 16   ┆ [0, 4, 8]  ┆ [1, 4, 7]  │
│ [2, 4, 0] ┆ [1, 2, 3] ┆ [2, 4, 0] ┆ [1, 2, 3] ┆ 20  ┆ 10   ┆ [4, 8, 0]  ┆ [3, 6, 3]  │
│ [4, 0, 8] ┆ [1, 2, 3] ┆ [4, 0, 8] ┆ [1, 2, 3] ┆ 80  ┆ 28   ┆ [8, 0, 16] ┆ [5, 2, 11] │
└───────────┴───────────┴───────────┴───────────┴─────┴──────┴────────────┴────────────┘

We see that values1 and values2 are what I expected from the pl.element().slice operations, but dot and sum seem to be performing on the first slice itself instead of first_slice.dot(second_slice) or first_slice + second_slice.

Expected behavior

I expect the dot column to be exactly the same as dot2 column, and the sum column to be the same as sum2.

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: 3.8.2 nest_asyncio: 1.6.0 numpy: 1.26.2 openpyxl: pandas: 2.1.4 pyarrow: 15.0.0 pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
jackaixin commented 1 month ago

Also, please kindly suggest the best way to perform dot product in my example above.

cmdlineluser commented 1 month ago

Yeah, that does not look right:

df = pl.DataFrame({
    'foo': [[1, 2, 3, 4]]
})

df.select(
    pl.col.foo.list.eval(pl.element().slice(0, 2)).alias('x'),
    pl.col.foo.list.eval(pl.element().slice(2, 2)).alias('y'),
    pl.col.foo.list.eval(
        pl.element().slice(0, 2) + pl.element().slice(2, 2)
    ).alias('x + y')
)

# shape: (1, 3)
# ┌───────────┬───────────┬───────────┐
# │ x         ┆ y         ┆ x + y     │
# │ ---       ┆ ---       ┆ ---       │
# │ list[i64] ┆ list[i64] ┆ list[i64] │
# ╞═══════════╪═══════════╪═══════════╡
# │ [1, 2]    ┆ [3, 4]    ┆ [2, 4]    │ # <- ERROR: x + x?
# └───────────┴───────────┴───────────┘
ruoyu0088 commented 1 month ago

Also, please kindly suggest the best way to perform dot product in my example above.

You can use explode and group_by:

df = pl.DataFrame({
    'values': [[0], [0, 2], [0, 2, 4], [2, 4, 0], [4, 0, 8]],
    'weights': [[3], [2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]
})

(
df
.lazy()
.with_row_index()
.explode('values', 'weights')
.group_by('index', maintain_order=True)
.agg(
    'values', 
    'weights',
    (pl.col.values * pl.col.weights).sum().alias('dot')
)
.drop('index')
.collect()
)
shape: (5, 3)
┌───────────┬───────────┬─────┐
│ values    ┆ weights   ┆ dot │
│ ---       ┆ ---       ┆ --- │
│ list[i64] ┆ list[i64] ┆ i64 │
╞═══════════╪═══════════╪═════╡
│ [0]       ┆ [3]       ┆ 0   │
│ [0, 2]    ┆ [2, 3]    ┆ 6   │
│ [0, 2, 4] ┆ [1, 2, 3] ┆ 16  │
│ [2, 4, 0] ┆ [1, 2, 3] ┆ 10  │
│ [4, 0, 8] ┆ [1, 2, 3] ┆ 28  │
└───────────┴───────────┴─────┘
jackaixin commented 1 month ago

@ruoyu0088 thanks for this. I tried another version with explode:

q2 = (
    df2
    .lazy()
    .with_row_index()
    .select(
        'values',
        'weights',
        (pl.col('values').explode() * pl.col('weights').explode()).sum().over('index').alias('dot')
    )
)

q2.collect()

which returns the same result. Performance is similar on the small example above (mine is slightly faster). However, when I applied your version (df.explode.group_by.agg) and mine (explode + over) to a larger dataframe (~3m rows), yours is 2x faster than mine. Do you have an idea why that might be the case?

Another comment to your explode implementation is that it seems to be consuming much more memory than the list.eval version, although the explode.group_by version is faster than list.eval.