pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.16k stars 1.94k forks source link

Concatenation of array columns, similar to concat_list #18090

Open adamreeve opened 2 months ago

adamreeve commented 2 months ago

Description

Polars allows concatentation of List typed columns with pl.concat_list. It would be useful to also allow concatenation of Array typed columns.

Eg:

df = pl.DataFrame([
    pl.Series('x', [[0, 1], None, [2, 3]], dtype=pl.Array(pl.Int64, 2)),
    pl.Series('y', [[4, 5, 6], [7, 8, 9], [10, 11, 12]], dtype=pl.Array(pl.Int64, 3)),
])

df.with_columns(z=pl.concat_array('x', 'y'))

This should produce a new column equivalent to:

pl.Series('z', [[0, 1, 4, 5, 6], None, [2, 3, 10, 11, 12]], dtype=pl.Array(pl.Int64, 5))
m00ngoose commented 2 months ago

concat_list doesn't do what you think it does! It constructs a new list column where the entries of the list are the input exprs. I would like to do the same, but for array. Eg.

df = pl.DataFrame(
    {
        'a': [1,2,3],
        'b': [4,5,6],
    }
)
df.select(
    pl.concat_list(pl.col('a'), pl.col('b')), 
    pl.Series(df.select('a', 'b').to_numpy(), dtype=pl.Array(pl.Int64, 2)),  # this should be just pl.concat_array(pl.col('a'), pl.col('b'))
)
adamreeve commented 2 months ago

concat_list does do what I think it does and also what you think it does :wink: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_list.html

concat_array should probably work similarly and allow creating an array from scalars or concatenating existing arrays, or using a mix of arrays and scalars.

m00ngoose commented 2 months ago

Ah fair! I only knew about Expr.list.concat for the other thing. I stand corrected.

I only care about one of those two cases, but as you say it's probably best to have both if it's going to be named analogously. Thinking about it more, I think concat_[list|array] is a bad name for the "make a list|array" case and they should be separate apis. Out-of-scope though.

cmdlineluser commented 2 months ago

@m00ngoose There has been some discussion of that if it is of interest:

adamreeve commented 2 months ago

Based on the discussion linked above it looks like we most likely want to have separate methods for array construction (pl.array) and array concatenation (pl.concat_arrays), which seems much cleaner to me than one function that does both. Further discussion about the method split and naming should probably stay in that issue, but I think it makes sense to keep this issue open for implementing the array methods.

corwinjoy commented 1 month ago

I have added a draft PR to discuss the design of a pl.array function in order to firm up what this would look like and how it should behave. @adamreeve