ds.map(f, num_proc=10) is slower than df.apply

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Apache License 2.0

19.28k stars 2.7k forks source link

Describe the bug

pandas columns: song_id, song_name ds = Dataset.from_pandas(df)

def has_cover(song_name): if song_name is None or pd.isna(song_name): return False return 'cover' in song_name.lower()

df['has_cover'] = df.song_name.progress_apply(has_cover) ds = ds.map(lambda x: {'has_cover': has_cover(x['song_name'])}, num_proc=10)

time cost:

df.apply: 100%|██████████| 12500592/12500592 [00:13<00:00, 959825.47it/s]
ds.map: Map (num_proc=10): 31% 3899028/12500592 [00:28<00:38, 222532.89 examples/s]

Steps to reproduce the bug

pandas columns: song_id, song_name ds = Dataset.from_pandas(df)

def has_cover(song_name): if song_name is None or pd.isna(song_name): return False return 'cover' in song_name.lower()

df['has_cover'] = df.song_name.progress_apply(has_cover) ds = ds.map(lambda x: {'has_cover': has_cover(x['song_name'])}, num_proc=10)

Expected behavior

ds.map is ~num_proc faster than df.apply

Environment info

pandas: 2.2.2 datasets: 2.19.1

Hi ! map() reads all the columns and writes the resulting dataset with all the columns as well, while df.column_name.apply only reads and writes one column and does it in memory. So this is speed difference is actually expected.

Moreover using multiprocessing on a dataset that lives in memory (from_pandas uses the same in-memory data as the pandas DataFrame while load_dataset or from_generator load from disk) requires to copy the data to each subprocess which can also be slow. Data loaded from disk don't need to be copied though since they work as a form of shared memory thanks to memory mapping.

However you can make you map() call much faster by making it read and write only the column you want:

has_cover_ds = ds.map(lambda song_name: {'has_cover': has_cover(song_name)}, input_columns=["song_name"], remove_columns=ds.column_names)  # outputs a dataset with 1 column
ds = ds.concatenate_datasets([ds, has_cover_ds], axis=1)

and if your dataset is loaded from disk you can pass num_proc=10 and get a nice speed up as well (no need to copy the data to subprocesses)

huggingface / datasets