Open lanlanlanlanlanlan365 opened 1 month ago
Hi ! map()
reads all the columns and writes the resulting dataset with all the columns as well, while df.column_name.apply only reads and writes one column and does it in memory. So this is speed difference is actually expected.
Moreover using multiprocessing on a dataset that lives in memory (from_pandas uses the same in-memory data as the pandas DataFrame while load_dataset or from_generator load from disk) requires to copy the data to each subprocess which can also be slow. Data loaded from disk don't need to be copied though since they work as a form of shared memory thanks to memory mapping.
However you can make you map() call much faster by making it read and write only the column you want:
has_cover_ds = ds.map(lambda song_name: {'has_cover': has_cover(song_name)}, input_columns=["song_name"], remove_columns=ds.column_names) # outputs a dataset with 1 column
ds = ds.concatenate_datasets([ds, has_cover_ds], axis=1)
and if your dataset is loaded from disk you can pass num_proc=10 and get a nice speed up as well (no need to copy the data to subprocesses)
Describe the bug
pandas columns: song_id, song_name ds = Dataset.from_pandas(df)
def has_cover(song_name): if song_name is None or pd.isna(song_name): return False return 'cover' in song_name.lower()
df['has_cover'] = df.song_name.progress_apply(has_cover) ds = ds.map(lambda x: {'has_cover': has_cover(x['song_name'])}, num_proc=10)
time cost:
Steps to reproduce the bug
pandas columns: song_id, song_name ds = Dataset.from_pandas(df)
def has_cover(song_name): if song_name is None or pd.isna(song_name): return False return 'cover' in song_name.lower()
df['has_cover'] = df.song_name.progress_apply(has_cover) ds = ds.map(lambda x: {'has_cover': has_cover(x['song_name'])}, num_proc=10)
Expected behavior
ds.map is ~num_proc faster than df.apply
Environment info
pandas: 2.2.2 datasets: 2.19.1