Closed Oreilles closed 1 week ago
The pandas code is probably modifying the column in-place, which is not comparable. The comparable code is probably something like this:
pd.options.mode.copy_on_write = True
def update_pandas_iloc(df, col, start, stop, value):
df = df.copy(deep=False)
df.iloc[start:stop, df.columns.get_loc(col)] = value
return df
Updating a column by slice is the operation that we want to execute, and the pandas example is the recommanded way of doing it. The fact that there is no way to do that exact operation in Polars is the reason this issue was created - the alternative Polars code examples are only given as a way to demonstrate the performance penalty that this missing feature incures.
The rationale behind the feature request is that it already is possible to update a series in place in Polars that way:
df[idx, column] = value
So it's not that far fetched to also be able to do:
df[start:end, column] = value
I thought polars dataframes were immutable... Now I find out my whole life has been a lie
I think even "in-place" modifying a single element in polars will actually copy the whole series:
import pandas as pd
import polars as pl
def inplace_modify(df):
df[0, "x"] = 42
df = pl.DataFrame({"x": range(1_000_000)})
df_pd = df.to_pandas()
%%timeit
inplace_modify(df)
# 2.92 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
inplace_modify(df_pd)
# 365 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
I think it ends up calling Series.__setitem__
and dispatches to .scatter()
Alright, so I guess that means that Series are indeed immutable... so I can forget about using a "faster" implementation than scatter for this feature request.
However, since df[idx, column] = value
delegates to scatter
, would you be open to add support for df[start:end, column] = value
by delegating to scatter(range(start, stop), value)
?
On a micro scale, mutating in place is faster, but if you write idiomatic Polars it will likely not be in most cases.
If you write df.with_columns(pl.when(..).then(..).otherwise(..))
, Polars will be able to parallelize multiple assignments, and actually do mutable assignment under the hood if it can proof that the value isn't used anywhere else.
The in place modify is a very procedural operation that forces single threaded operations, has no information for the optimizer and... Likely will incur a copy in pandas 3.0 as well as they are moving to arrow and immutable data.
I guess that issue can then be closed.
If you write df.with_columns(pl.when(..).then(..).otherwise(..)), Polars will be able to parallelize multiple assignments, and actually do mutable assignment under the hood if it can proof that the value isn't used anywhere else.
@ritchie46 surely this is still slower than an in-place modification of a single Series? With 1 million assignment indexes, that means you have at least 1 million elements, 1 million when/then clauses, and potentially 1 million copies the array.
If you write df.with_columns(pl.when(..).then(..).otherwise(..)), Polars will be able to parallelize multiple assignments, and actually do mutable assignment under the hood if it can proof that the value isn't used anywhere else.
@ritchie46 surely this is still slower than an in-place modification of a single Series? With 1 million assignment indexes, that means you have at least 1 million elements, 1 million when/then clauses, and potentially 1 million copies the array.
You should write a replace or a join, not a million when then clauses.
Description
It sometimes happen that we want to update all values of a Series within a index range/slice. In Pandas, it is pretty straightforward and fast using
iloc[start:stop]
. Polars, however, doesn't allow series slice assignment so we have to choose between several other options, which unfortunately all are at least 10 times slower.Here is some sample code that benchmark the Pandas solution and the Polars alternatives:
Proposition: Allow
slice
as an argument to Series.__getitem_\, and have it use a faster implementation than currentscatter