modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.91k stars 653 forks source link

BUG: df[col].replace(dict, inplace=True) is brutally slow, while .apply which does the same is blazing fast #7377

Open Liquidmasl opened 2 months ago

Liquidmasl commented 2 months ago

Modin version checks

Reproducible Example

# create somewhat large dataframe, partition it into ~25 parts

df = pd.DataFrame({
    'val': np.random.randint(1, 10001, size=20_000_000)
})

unique = orig_with_hashes['df'].unique()
hash_map = {hash_: i for i, hash_ in enumerate(unique_hashes)}

# fast:
# df['val'] = df['val'].apply(lambda x: hash_map[x])

# slow:
df['val'].replace(hash_map, in_place=True)

Issue Description

replace method seams unreasonably slow.

Expected Behavior

should be faster

Error Logs

```python-traceback Replace this line with the error backtrace (if applicable). ```

Installed Versions

INSTALLED VERSIONS ------------------ commit : 52fca1ccaf8f4623688955f724f504a5e80c332c python : 3.11.9.final.0 python-bits : 64 OS : Linux OS-release : 6.8.0-40-generic Version : #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 Modin dependencies ------------------ modin : 0.30.1 ray : 2.24.0 dask : 2024.6.0 distributed : 2024.6.0 hdk : None pandas dependencies ------------------- pandas : 2.2.2 numpy : 1.26.3 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 69.5.1 pip : 24.1.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.25.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.2.0 gcsfs : None matplotlib : 3.8.4 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.14.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None