pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.2k stars 1.84k forks source link

Left join large memory usage regression #18106

Open CHDev93 opened 1 month ago

CHDev93 commented 1 month ago

Checks

Reproducible example

# polars_join_bug_mwe.py 

import time

import numpy as np
import polars as pl

#@profile
def get_index_with_time(idx_df: pl.DataFrame, data_df: pl.DataFrame) -> pl.DataFrame:
    time_coords_df = data_df.select(pl.col("time").unique().sort())
    return idx_df.join(time_coords_df, how="cross")

#@profile
def process_data(time_idx_df, data_df):
    tmp_df = time_idx_df.join(
        data_df,
        how="left",
        left_on=["sensor", "date", "time"],
        right_on=["_sensor", "_date", "time"],
        coalesce=True,
    )
    time.sleep(2)

    # _sensor, _date, time, _marker, 0, 1, 2, 3
    data_df = tmp_df.drop("sensor", "date", "time").fill_null(0)
    return data_df

#@profile
def get_data():

    sensor_df = pl.DataFrame({"sensor": np.arange(1000)})
    date_df = pl.DataFrame({"date": np.arange(250)})

    # sensor, date
    idx_df = sensor_df.join(date_df, how="cross")

    id_df = pl.DataFrame({"_sensor": np.arange(1000)})
    day_df = pl.DataFrame({"_date": np.arange(250)})
    times_df = pl.DataFrame({"time": np.arange(150)})

    # _sensor, _date, time, _marker, 0, 1, 2, 3
    data_df = (
        id_df.join(day_df, how="cross")
        .join(times_df, how="cross")
        .with_columns(
            pl.lit(0).alias("_marker"),
            pl.lit(0.0).cast(pl.Float32).alias("0"),
            pl.lit(1.0).cast(pl.Float32).alias("1"),
            pl.lit(2.0).cast(pl.Float32).alias("2"),
            pl.lit(3.0).cast(pl.Float32).alias("3"),
        )
    )

    return idx_df, data_df

#@profile
def main():
    import time

    start = time.perf_counter()
    idx_df, data_df = get_data()

    time_idx_df = get_index_with_time(idx_df, data_df)
    print(f"Augment index: {time.perf_counter() - start:.3f}s")

    data_df = process_data(time_idx_df, data_df)
    print(f"Process data: {time.perf_counter() - start:.3f}s")

    return data_df

def sleepy():
    import time

    time.sleep(5)

if __name__ == "__main__":
    df = main()

    # I get weird Keyboard interrupts when running polars through memory-profiler without some sleeps
    sleepy()

I installed memo

Log output

join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
join parallel: true
CROSS join dataframes finished
Augment index: 1.441s
join parallel: true
LEFT join dataframes finished
Process data: 7.592s

Issue description

I'm doing a join of two tables on a compound key of [int, int, int]. In newer versions of polars (inlcuding polars==1.4.1) it uses much more memory than I'd expect. I confirmed by rolling back to polars==0.19.19 and found it did use significantly less memory.

Expected behavior

I'd expect a left join for this problem to use something like 2x the space of the left table as it did in 0.19.19. I ran the script with python's memory-profiler package and used the commands

Running: mprof run --python -o polars_141_small.prof -M --include-children python polars_join_bug_mwe.py Plotting: mprof plot polars_141_small.prof --title polars_1_4_1 -o polars_141_small.png -w 0,12

image

image

Installed versions

``` --------Version info--------- Polars: 1.4.1 Index type: UInt32 Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2024.2.0 gevent: great_tables: hvplot: matplotlib: 3.8.3 nest_asyncio: 1.6.0 numpy: 1.23.5 openpyxl: pandas: 1.5.3 pyarrow: 11.0.0 pydantic: 2.6.3 pyiceberg: sqlalchemy: torch: 2.1.0+cu118 xlsx2csv: xlsxwriter: ```
CHDev93 commented 1 month ago

Other things of note