modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.91k stars 653 forks source link

PERF-#7299: Avoid using `synchronize_labels` for `combine` function #7300

Closed anmyachev closed 5 months ago

anmyachev commented 5 months ago

What do these changes do?

Perf gain: ~15% against main branch with Ray 8 cores.

import modin.pandas as pd
from modin.utils import execute
import numpy as np

from time import time

df1 = pd.DataFrame(np.random.randint(1_000_000, size=(10_000, 100)))
df2 = pd.DataFrame(np.random.randint(1_000_000, size=(1_000_000, 100)))

for _ in range(5):
    start = time()
    df2._query_compiler._modin_frame._deferred_index = True
    res = df1.merge(df2, on=3)
    execute(res)
    print(f"merge time: {time()-start}")
YarShev commented 5 months ago

This is great @anmyachev!

I think a good medium term goal would be to have support for a fully lazy index object, adding it as a first class citizen to the query compiler (and adding index.py to modin/pandas.

LGTM

@dchigarev, @anmyachev, to what extent does our current ModinIndex perform this task?