pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.15k stars 1.94k forks source link

replace / replace_strict is slow for large mapping #17638

Open AtollRewe opened 3 months ago

AtollRewe commented 3 months ago

Checks

Reproducible example

import polars as pl
import time

df = pl.DataFrame(
    {
        "a": [1],
    }
)
replace_dict = {str(i): -i for i in range(1, 10_000_000)}

before = time.time()
x = df.select(pl.col("a").replace_strict(replace_dict, default=0))
print("Passing mapping directly: ", time.time() - before)

k, v = pl.Series(replace_dict.keys()), pl.Series(replace_dict.values())
before = time.time()
x = df.select(pl.col("a").replace_strict(k, v, default=0))
print("Passing mapping as Series: ", time.time() - before)

before = time.time()
x = df.select(pl.col("a").map_elements(lambda x: replace_dict.get(str(x), 0), return_dtype=pl.Int32))
print("Running a UDF: ", time.time() - before)

Log output

Passing mapping directly:  1.111785888671875
Passing mapping as Series:  0.28866100311279297
Running a UDF:  0.0005211830139160156

Issue description

In a use case of wanting to replace a small number of data points using a large replacement mapping, the recommended replace (or replace_strict) becomes extremely slow; about 2000x slower than using map_elements with a dict lookup.

Expected behavior

I'm not sure if one could expect a naive as in my example to always be fast. I assume there is a notable overhead because the Python dict needs to be converted into some kind of Rust structure. But I feel like there should be some way to define an efficient Polars-native mapping object (that's represented by a HashMap or something similar internally), that could be pre-computed once and then reused to perform fast mappings with replace.

At the very least, imo there should be a warning in the documentation of replace, like there is for map_elements, warning of this use case and directing the user to another function.

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: macOS-14.2.1-arm64-arm-64bit Python: 3.11.7 (v3.11.7:fa7a6f2303, Dec 4 2023, 15:22:56) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2024.6.1 gevent: great_tables: hvplot: matplotlib: 3.9.1 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 15.0.2 pydantic: 2.8.2 pyiceberg: sqlalchemy: 2.0.31 torch: xlsx2csv: xlsxwriter: ```
ritchie46 commented 3 months ago

This will always be slower for a single element dict lookup. We have to go to Rust and do checks on that side.

However, there is still much to gain. Our implementation is rather naive and does some checks needlessly (e.g. coming from a dict already has uniqueness guarantees). Will look at this a bit later.

AtollRewe commented 3 months ago

I understand that a simple Python dict lookup would always be faster than involving Polars in some way. That's why I suggest to add a warning about this case to the documentation of replace and replace_strict and/or offer some kind of pre-computed mapping (i.e. what you describe as "go to Rust and do checks").

To elaborate my use case: I first need to run some data processing on a large DataFrame once. After that, I need to run the same data processing on single-row DataFrames several times. So what I'm doing now as a workaround is a branch to use replace for DataFrames with more than 10 elements and map_elements otherwise but that's just very hacky. So it'd be great if there was some way to provide a uniform implementation that can be used efficiently for both cases.