pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.25k stars 1.95k forks source link

Provide native and fast Series slice assignment (currently slower than Pandas) #17728

Closed Oreilles closed 1 week ago

Oreilles commented 3 months ago

Description

It sometimes happen that we want to update all values of a Series within a index range/slice. In Pandas, it is pretty straightforward and fast using iloc[start:stop]. Polars, however, doesn't allow series slice assignment so we have to choose between several other options, which unfortunately all are at least 10 times slower.

Here is some sample code that benchmark the Pandas solution and the Polars alternatives:

import timeit

import pandas as pd
import polars as pl

def update_pandas_iloc(df, col, start, stop, value):
    df.iloc[start:stop, df.columns.get_loc(col)] = value
    return df

def update_polars_extend(df, col, start, stop, value):
    return df.with_columns(
        foo=df[0:start, col]
        .extend(pl.repeat(value, stop - start, eager=True, dtype=df[col].dtype))
        .extend(df[stop : len(df), col]),
    )

def update_polars_between(df, col, start, stop, value):
    return df.with_row_index().with_columns(
        pl.when(pl.col("index").is_between(start, stop)).then(value).otherwise(pl.col(col)),
    )

def update_polars_scatter(df, col, start, stop, value):
    return df.with_columns(df[col].scatter(range(start, stop), value))

def update_polars_update(df, col, start, stop, value):
    return df.with_row_index().update(
        pl.select(
            pl.int_range(start, stop, dtype=pl.UInt32).alias("index"),
            pl.lit(value).alias(col),
        ),
        on="index",
    )

col, start, stop, value = "foo", 300000, 600000, -1
pddf = pd.DataFrame({col: range(int(1e6))})
pldf = pl.DataFrame({col: range(int(1e6))})

results = pl.DataFrame((
    {
        "name": "update_pandas_iloc",
        "time": timeit.timeit(
            lambda: update_pandas_iloc(pddf, col, start, stop, value),
            number=10000,
        ),
    },
    {
        "name": "update_polars_extend",
        "time": timeit.timeit(
            lambda: update_polars_extend(pldf, col, start, stop, value),
            number=10000,
        ),
    },
    {
        "name": "update_polars_between",
        "time": timeit.timeit(
            lambda: update_polars_between(pldf, col, start, stop, value),
            number=10000,
        ),
    },
    {
        "name": "update_polars_scatter",
        "time": timeit.timeit(
            lambda: update_polars_scatter(pldf, col, start, stop, value),
            number=10000,
        ),
    },
    {
        "name": "update_polars_update",
        "time": timeit.timeit(
            lambda: update_polars_update(pldf, col, start, stop, value),
            number=10000,
        ),
    },
))

print(str(results))
# ┌───────────────────────┬───────────┐
# │ name                  ┆ time      │
# │ ---                   ┆ ---       │
# │ str                   ┆ f64       │
# ╞═══════════════════════╪═══════════╡
# │ update_pandas_iloc    ┆ 0.980486  │
# │ update_polars_extend  ┆ 9.222296  │
# │ update_polars_between ┆ 11.047009 │
# │ update_polars_scatter ┆ 19.065691 │
# │ update_polars_update  ┆ 67.748916 │
# └───────────────────────┴───────────┘

Proposition: Allow slice as an argument to Series.__getitem_\, and have it use a faster implementation than current scatter

df[start:stop, col] = value
s-banach commented 3 months ago

The pandas code is probably modifying the column in-place, which is not comparable. The comparable code is probably something like this:

pd.options.mode.copy_on_write = True

def update_pandas_iloc(df, col, start, stop, value):
    df = df.copy(deep=False)
    df.iloc[start:stop, df.columns.get_loc(col)] = value
    return df
Oreilles commented 3 months ago

Updating a column by slice is the operation that we want to execute, and the pandas example is the recommanded way of doing it. The fact that there is no way to do that exact operation in Polars is the reason this issue was created - the alternative Polars code examples are only given as a way to demonstrate the performance penalty that this missing feature incures.

The rationale behind the feature request is that it already is possible to update a series in place in Polars that way:

df[idx, column] = value

So it's not that far fetched to also be able to do:

df[start:end, column] = value
s-banach commented 3 months ago

I thought polars dataframes were immutable... Now I find out my whole life has been a lie

s-banach commented 3 months ago

I think even "in-place" modifying a single element in polars will actually copy the whole series:

import pandas as pd
import polars as pl

def inplace_modify(df):
    df[0, "x"] = 42

df = pl.DataFrame({"x": range(1_000_000)})
df_pd = df.to_pandas()
%%timeit
inplace_modify(df)
# 2.92 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
inplace_modify(df_pd)
# 365 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
cmdlineluser commented 3 months ago

I think it ends up calling Series.__setitem__ and dispatches to .scatter()

https://github.com/pola-rs/polars/blob/8928f7a5ae872c871e595569cfa0028f740d9fce/py-polars/polars/series/series.py#L1253

Oreilles commented 3 months ago

Alright, so I guess that means that Series are indeed immutable... so I can forget about using a "faster" implementation than scatter for this feature request.

However, since df[idx, column] = value delegates to scatter, would you be open to add support for df[start:end, column] = value by delegating to scatter(range(start, stop), value) ?

ritchie46 commented 3 months ago

On a micro scale, mutating in place is faster, but if you write idiomatic Polars it will likely not be in most cases.

If you write df.with_columns(pl.when(..).then(..).otherwise(..)), Polars will be able to parallelize multiple assignments, and actually do mutable assignment under the hood if it can proof that the value isn't used anywhere else.

The in place modify is a very procedural operation that forces single threaded operations, has no information for the optimizer and... Likely will incur a copy in pandas 3.0 as well as they are moving to arrow and immutable data.

Oreilles commented 3 months ago

I guess that issue can then be closed.

mcrumiller commented 2 months ago

If you write df.with_columns(pl.when(..).then(..).otherwise(..)), Polars will be able to parallelize multiple assignments, and actually do mutable assignment under the hood if it can proof that the value isn't used anywhere else.

@ritchie46 surely this is still slower than an in-place modification of a single Series? With 1 million assignment indexes, that means you have at least 1 million elements, 1 million when/then clauses, and potentially 1 million copies the array.

ritchie46 commented 1 week ago

If you write df.with_columns(pl.when(..).then(..).otherwise(..)), Polars will be able to parallelize multiple assignments, and actually do mutable assignment under the hood if it can proof that the value isn't used anywhere else.

@ritchie46 surely this is still slower than an in-place modification of a single Series? With 1 million assignment indexes, that means you have at least 1 million elements, 1 million when/then clauses, and potentially 1 million copies the array.

You should write a replace or a join, not a million when then clauses.