pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.89k stars 1.92k forks source link

Allow list.eval to reference named columns #7210

Open lucazanna opened 1 year ago

lucazanna commented 1 year ago

Problem description

I wish arr.eval could reference names columns for easier filtering.

Here is an example:

df = pl.DataFrame({
    'a': [0,5,15],
    'a_list': [[0,5,15]]*3
})

shape: (3, 2)
┌─────┬────────────┐
│ a   ┆ a_list     │
│ --- ┆ ---        │
│ i64 ┆ list[i64]  │
╞═════╪════════════╡
│ 0   ┆ [0, 5, 15] │
│ 5   ┆ [0, 5, 15] │
│ 15  ┆ [0, 5, 15] │
└─────┴────────────┘

# Filtering the lists based on a column is however not possible with arr.eval

df.with_columns(
    list_higher_values = pl.col('a_list').list.eval(pl.element().filter(pl.element()> pl.col('a'))),
    list_all_values_except_current = pl.col('a_list').list.eval(pl.element().filter(pl.element() != pl.col('a')))
)

# this returns an error: ComputeError: named columns are not allowed in `list.eval`; consider using `element` or `col("")`

Is it possible to allow referencing of columns in list.eval ?

The other solution is to explode the dataframe, then group it back. However that adds some additional lines of codes.

What are your thoughts?

EDIT (marco): updating syntax

lucazanna commented 1 year ago

I read a Stack Overflow question where referencing other columns in arr.eval could make for an easier synthax compared to groupby: https://stackoverflow.com/questions/76037097/polars-element-wise-list-operations-using-another-column

I imagine this might not give any performance increase, but it would be a nice 'quality of life' improvement

dashdeckers commented 7 months ago

This would be a great addition!

This being missing is the only reason I have to convert my quite large dataset to numpy for a certain step in my data pipeline at work currently.

I have a List[f64] column in which I have to set values at certain indices to missing (nan), and these indices depend on some arithmetic involving the corresponding value from another column..

I've been going in circles trying to implement this within the polars API, but keep running into this roadblock. If I have overlooked some way to do this, please do let me know! Otherwise I would be overjoyed if this feature could be implemented someday!

MarcoGorelli commented 7 months ago

I've been going in circles trying to implement this within the polars API, but keep running into this roadblock

For now, if you're feeling brave, you could try writing a plugin, it'll likely be easier than you think: https://marcogorelli.github.io/polars-plugins-tutorial/lists_in_lists_out/ . There's a "plugins" channel on the Polars discord where people are happy to help https://discord.gg/4UfP5cfBE7

dashdeckers commented 7 months ago

That is a fantastic tutorial. Count me inspired, thank you!