huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.04k stars 2.64k forks source link

Slow #0 when using map to tokenize. #2294

Open VerdureChen opened 3 years ago

VerdureChen commented 3 years ago

Hi, datasets is really amazing! I am following run_mlm_no_trainer.py to pre-train BERT, and it uses tokenized_datasets = raw_datasets.map( tokenize_function, batched=True, num_proc=args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not args.overwrite_cache, ) to tokenize by multiprocessing. However, I have found that when num_proc>1,the process #0 is much slower than others. It looks like this: image It takes more than 12 hours for #0, while others just about half an hour. Could anyone tell me it is normal or not, and is there any methods to speed up it?

lhoestq commented 3 years ago

Hi ! Have you tried other values for preprocessing_num_workers ? Is it always process 0 that is slower ? There are no difference between process 0 and the others except that it processes the first shard of the dataset.

VerdureChen commented 3 years ago

Hi, I have found the reason of it. Before using the map function to tokenize the data, I concatenate the wikipedia and bookcorpus first, like this:

        dataset1 = load_dataset(args.dataset_name1, args.dataset_config_name1, split="train")
        dataset1 = dataset1.remove_columns('title')
        if args.dataset_name2 is not None:
            dataset2 = load_dataset(args.dataset_name2, args.dataset_config_name2,split="train")
            assert dataset1.features.type == dataset2.features.type, str(dataset1.features.type)+';'+str(dataset2.features.type)
           datasets12 = concatenate_datasets([dataset1, dataset2], split='train')

When I just use one datasets, e.g. wikipedia, the problem seems no longer exist: image

Bookcorpus has more row numbers than Wikipedia, however, it takes much more time to process each batch of wiki than that of bookcorpus. When we first concatenate two datasets and then use map to process the concatenated datasets, e.g. num_proc=5, process 0 has to process all of the wikipedia data, causing the problem that #0 takes a longer time to finish the job.

The problem is caused by the different characteristic of different datasets. One solution might be using map first to process two datasets seperately, then concatenate the tokenized and processed datasets before input to the Dataloader.

lhoestq commented 3 years ago

That makes sense ! You can indeed use map on both datasets separately and then concatenate. Another option is to concatenate, then shuffle, and then map.