huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

ds.map(f, num_proc=10) is slower than df.apply #7217

Open lanlanlanlanlanlan365 opened 1 month ago

lanlanlanlanlanlan365 commented 1 month ago

Describe the bug

pandas columns: song_id, song_name ds = Dataset.from_pandas(df)

def has_cover(song_name): if song_name is None or pd.isna(song_name): return False return 'cover' in song_name.lower()

df['has_cover'] = df.song_name.progress_apply(has_cover) ds = ds.map(lambda x: {'has_cover': has_cover(x['song_name'])}, num_proc=10)

time cost:

  1. df.apply: 100%|██████████| 12500592/12500592 [00:13<00:00, 959825.47it/s]
  2. ds.map: Map (num_proc=10):  31%  3899028/12500592 [00:28<00:38, 222532.89 examples/s]

Steps to reproduce the bug

pandas columns: song_id, song_name ds = Dataset.from_pandas(df)

def has_cover(song_name): if song_name is None or pd.isna(song_name): return False return 'cover' in song_name.lower()

df['has_cover'] = df.song_name.progress_apply(has_cover) ds = ds.map(lambda x: {'has_cover': has_cover(x['song_name'])}, num_proc=10)

Expected behavior

ds.map is ~num_proc faster than df.apply

Environment info

pandas: 2.2.2 datasets: 2.19.1

lhoestq commented 1 month ago

Hi ! map() reads all the columns and writes the resulting dataset with all the columns as well, while df.column_name.apply only reads and writes one column and does it in memory. So this is speed difference is actually expected.

Moreover using multiprocessing on a dataset that lives in memory (from_pandas uses the same in-memory data as the pandas DataFrame while load_dataset or from_generator load from disk) requires to copy the data to each subprocess which can also be slow. Data loaded from disk don't need to be copied though since they work as a form of shared memory thanks to memory mapping.

However you can make you map() call much faster by making it read and write only the column you want:

has_cover_ds = ds.map(lambda song_name: {'has_cover': has_cover(song_name)}, input_columns=["song_name"], remove_columns=ds.column_names)  # outputs a dataset with 1 column
ds = ds.concatenate_datasets([ds, has_cover_ds], axis=1)

and if your dataset is loaded from disk you can pass num_proc=10 and get a nice speed up as well (no need to copy the data to subprocesses)