pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.68k stars 1.99k forks source link

Implement/Expose List/Array "builder" or "cast"er similar to `pa.LargeListArray.from_arrays` with expressions as inputs #17578

Open deanm0000 opened 4 months ago

deanm0000 commented 4 months ago

Description

This was inspired by this SO question

Here's a simpler example, imagine we have two list columns and we want to add them

df=pl.DataFrame({
    'a':[[1,2,3],[2,3,4],[5,7]],
    'b':[[2,3,4],[7,8,9],[1,2]]
})

One approach is

df.with_columns(
    c=(pl.col('a').explode()*pl.col('b').explode()).implode().over(pl.int_range(pl.len()))
)

but the over is really unnecessary. We can use pyarrow to do this instead

df.with_columns(
    c=pl.struct('a','b').map_batches(lambda s: pl.from_arrow(
        pa.LargeListArray.from_arrays(
            s.struct.field('a').to_arrow().offsets,
            s.struct.field('a').explode()*s.struct.field('b').explode()
        )
    ))
)

Performance

df=pl.DataFrame([
    pl.Series('a',np.random.randint(0,10,100000),dtype=pl.Float64),
    pl.Series('b',np.random.randint(0,10,100000),dtype=pl.Float64),
    pl.Series('c',np.random.randint(0,8,100000),dtype=pl.Float64)
]).group_by('c').agg('a','b').drop('c')
%%timeit
df.with_columns(
    c=pl.struct('a','b').map_batches(lambda s: pl.from_arrow(
        pa.LargeListArray.from_arrays(
            s.struct.field('a').to_arrow().offsets,
            (s.struct.field('a').explode()*s.struct.field('b').explode()).to_arrow()
            # if you don't do an explicit to_arrow above then it'll work but performance will be 60x worse
        )
    ))
)
2.07 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.with_columns(
    c=(pl.col('a').explode()*pl.col('b').explode()).implode().over(pl.int_range(pl.len()))
)
4.4 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So this is half the time.

cmdlineluser commented 4 months ago

Is the Array type intended for these cases? (although it requires same shape)

df = pl.DataFrame({
    'a':[[1,2,3],[2,3,4],[5,7, None]],
    'b':[[2,3,4],[7,8,9],[1,2, None]]
}).cast(pl.Array(pl.Int64, 3))

df.with_columns(c = pl.col.a * pl.col.b)
shape: (3, 3)
┌───────────────┬───────────────┬───────────────┐
│ a             ┆ b             ┆ c             │
│ ---           ┆ ---           ┆ ---           │
│ array[i64, 3] ┆ array[i64, 3] ┆ array[i64, 3] │
╞═══════════════╪═══════════════╪═══════════════╡
│ [1, 2, 3]     ┆ [2, 3, 4]     ┆ [2, 6, 12]    │
│ [2, 3, 4]     ┆ [7, 8, 9]     ┆ [14, 24, 36]  │
│ [5, 7, null]  ┆ [1, 2, null]  ┆ [5, 14, null] │
└───────────────┴───────────────┴───────────────┘
deanm0000 commented 4 months ago

@cmdlineluser Sure as long as the lists are the same size then that's better. I deliberately made the example different sizes to illustrate that there's still a need.

On the syntax, I'm not sure if it should a new method, or if it'd be part of cast or implode. Maybe like

df=pl.DataFrame({'a':[1,2,3,4]})
# to list
df.select(pl.col('a').cast(pl.List(pl.Int64, [0,2,4]))
# to array
df.select(pl.col('a').cast(pl.List(pl.Int64, 2))

or

df=pl.DataFrame({'a':[1,2,3,4]})
# to list
df.select(pl.col('a').implode([0,2,4]))
# to array
df.select(pl.col('a').implode(2))

so if implode gets a list then it becomes a list, if it gets a scaler then it becomes an array.

In typing it out, I'm leaning more towards the latter.

deanm0000 commented 4 months ago

use case https://github.com/pola-rs/polars/issues/14711

itamarst commented 4 months ago

Adding two Series with lists is implemented in #17823. Is that PR sufficient to close this issue too?

deanm0000 commented 4 months ago

@itamarst no, the point isn't addition, that was just an example.