pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.53k stars 1.98k forks source link

Left join large memory usage regression #18106

Open CHDev93 opened 3 months ago

CHDev93 commented 3 months ago

Checks

Reproducible example

# polars_join_bug_mwe.py 

import time

import numpy as np
import polars as pl

#@profile
def get_index_with_time(idx_df: pl.DataFrame, data_df: pl.DataFrame) -> pl.DataFrame:
    time_coords_df = data_df.select(pl.col("time").unique().sort())
    return idx_df.join(time_coords_df, how="cross")

#@profile
def process_data(time_idx_df, data_df):
    tmp_df = time_idx_df.join(
        data_df,
        how="left",
        left_on=["sensor", "date", "time"],
        right_on=["_sensor", "_date", "time"],
        coalesce=True,
    )
    time.sleep(2)

    # _sensor, _date, time, _marker, 0, 1, 2, 3
    data_df = tmp_df.drop("sensor", "date", "time").fill_null(0)
    return data_df

#@profile
def get_data():

    sensor_df = pl.DataFrame({"sensor": np.arange(1000)})
    date_df = pl.DataFrame({"date": np.arange(250)})

    # sensor, date
    idx_df = sensor_df.join(date_df, how="cross")

    id_df = pl.DataFrame({"_sensor": np.arange(1000)})
    day_df = pl.DataFrame({"_date": np.arange(250)})
    times_df = pl.DataFrame({"time": np.arange(150)})

    # _sensor, _date, time, _marker, 0, 1, 2, 3
    data_df = (
        id_df.join(day_df, how="cross")
        .join(times_df, how="cross")
        .with_columns(
            pl.lit(0).alias("_marker"),
            pl.lit(0.0).cast(pl.Float32).alias("0"),
            pl.lit(1.0).cast(pl.Float32).alias("1"),
            pl.lit(2.0).cast(pl.Float32).alias("2"),
            pl.lit(3.0).cast(pl.Float32).alias("3"),
        )
    )

    return idx_df, data_df

#@profile
def main():
    import time

    start = time.perf_counter()
    idx_df, data_df = get_data()

    time_idx_df = get_index_with_time(idx_df, data_df)
    print(f"Augment index: {time.perf_counter() - start:.3f}s")

    data_df = process_data(time_idx_df, data_df)
    print(f"Process data: {time.perf_counter() - start:.3f}s")

    return data_df

def sleepy():
    import time

    time.sleep(5)

if __name__ == "__main__":
    df = main()

    # I get weird Keyboard interrupts when running polars through memory-profiler without some sleeps
    sleepy()

I installed memo

Log output

join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
Augment index: 1.441s
join parallel: true
LEFT join dataframes finished
Process data: 7.592s

Issue description

I'm doing a join of two tables on a compound key of [int, int, int]. In newer versions of polars (inlcuding polars==1.4.1) it uses much more memory than I'd expect. I confirmed by rolling back to polars==0.19.19 and found it did use significantly less memory.

Expected behavior

I'd expect a left join for this problem to use something like 2x the space of the left table as it did in 0.19.19. I ran the script with python's memory-profiler package and used the commands

Running: mprof run --python -o polars_141_small.prof -M --include-children python polars_join_bug_mwe.py Plotting: mprof plot polars_141_small.prof --title polars_1_4_1 -o polars_141_small.png -w 0,12

image

image

Installed versions

``` --------Version info--------- Polars: 1.4.1 Index type: UInt32 Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2024.2.0 gevent: great_tables: hvplot: matplotlib: 3.8.3 nest_asyncio: 1.6.0 numpy: 1.23.5 openpyxl: pandas: 1.5.3 pyarrow: 11.0.0 pydantic: 2.6.3 pyiceberg: sqlalchemy: torch: 2.1.0+cu118 xlsx2csv: xlsxwriter: ```
CHDev93 commented 3 months ago

Other things of note

corwinjoy commented 3 days ago

OK. Can confirm that we see this performance regression with the latest dev version of polars. Latest seems to perform significantly worse both in memory but run time is comparable. Ran the following script with @profile enabled. Had to remove coalesce=True in line 17 since this is not supported in 0.19.19. @adamreeve

#dev version: 1.12.0
source .venv/bin/activate
make build-dist-release
mprof run --python -o polars_1120_small.prof  -M --include-children python polars_join_bug_mwe.py 
# Augment index: 0.684s
# Process data: 5.261s
mprof plot polars_1120_small.prof --title polars_1_12_0 -o polars_1_12_0_small.png -w 0,12

# Compare with 0.19.19
pip uninstall polars 
pip install --force-reinstall -v "polars==0.19.19"
mprof run --python -o polars_1919_small.prof  -M --include-children python polars_join_bug_mwe.py 
# Augment index: 0.859s
# Process data: 5.787s
mprof plot polars_1919_small.prof --title polars_0_19_19 -o polars_0_19_19_small.png -w 0,12
corwinjoy commented 3 days ago

polars_0_19_19_small polars_1_12_0_small

ritchie46 commented 3 days ago

I cannot reproduce. First bump is 1.14, second bump is 0.19.19

image

Performance difference:

1.14: 4.899s
0.19.19: 5.804s
CHDev93 commented 2 days ago

@corwinjoy the plot for 0.19.19 shows process_data being essentially instant. Are you sure the same thing is being run in both cases? The graph for polars 1.12.0 looks far too slow compared with 0.19.19

corwinjoy commented 2 days ago

@CHDev93 The same script was being run in both cases, but I agree that the time for process_data was odd from mprof. I think the sleep command was throwing off the timings there. Re-running 0.19.19 with the sleep commands removed I get more reasonable timings out. However, the memory issue remains as discussed on discord. It looks like @ritchie46 has some ideas there. polars_0_19_19_small

The mprof command outputs (with sleep removed using 0.19.19):


Augment index: 0.879s
Process data: 3.900s
``
ritchie46 commented 1 day ago

Yes, we now go into the row-encodign for the group-by. The new algorithm is faster when data doesn't fit in your cache size anymore. You must increase the dataset (depending on the beefyness of your machine) to see the result. Latest Polars is 1.3-1.5x faster for me, but does indeed require more memory (1.5).

The old code had to go though. The memory requirements will improve with the new-streaming engine and with fixed size row-encoding which we plan to add as well.

19929 will also improve this if we land it.

In any case, it isn't a bug but the cost of our new algorithm. We have to be able to remove old code branches if they hurt us and sometimes this has a different memory footprint.