Open sanjari-orb opened 3 weeks ago
@sanjari-orb sure! My only hesitation in doing this is that we've observed occasional hangs when using hf datasets and multiprocessing (https://github.com/huggingface/datasets/issues/6393), but should be fine, especially if we keep it single process by default. Would be happy to accept a PR adding the arg.
Actually we ended up seeing the same problem of the map()
hanging while loading ICL evaluations with num_proc>1, and unluckily this happens frequently enough.
Do you have any insights on how this problem was solved in mosaicml?
Unfortunately I have never managed to fully root cause this issue (feel free to comment on the datasets issue, as I don't think they have been able to fix it either). However, I believe it has something to do with multiple processes processing the same data at the same time. As a result, in the main dataloader we have local rank 0 go first, so that all the other ranks are just reading data cached on disk. We could probably apply the same logic in the ICL classes.
Could you give me a pointer to where this is being handled?
Ah yeah sorry, meant to include the link. https://github.com/mosaicml/llm-foundry/blob/2196d073b296e28c5e84852954f53721cb1cc5e5/llmfoundry/data/finetuning/tasks.py#L831-L837 for the wait, and https://github.com/mosaicml/llm-foundry/blob/2196d073b296e28c5e84852954f53721cb1cc5e5/llmfoundry/data/finetuning/tasks.py#L945-L956 for the cleanup. I added some nicer utils for this to composer also (https://github.com/mosaicml/composer/pull/3396) but haven't updated foundry yet to use them.
We are already doing that here though right? https://github.com/mosaicml/llm-foundry/blob/2196d073b296e28c5e84852954f53721cb1cc5e5/llmfoundry/eval/datasets/in_context_learning_evaluation.py#L265-L268
not quite. in the code I linked we have rank 0 go first for the dataset load. In the code you linked, we have only rank 0 download the file, but then all ranks would call load_dataset
at the same time
Ah gotcha. Okay let me try this. Thanks!
🚀 Feature Request
Allow passing
num_proc
/num_workers
parameter inInContextLearningDataset
so that preparation of dataset can use more than one processes.Motivation
When loading bigger ICL eval datasets, it is desirable to pass num_procs>1 in the following map function, which preps each example in the dataset: https://github.com/mosaicml/llm-foundry/blob/5571101a50804406ef0fe23e7ea6795b3c4a1bcb/llmfoundry/eval/datasets/in_context_learning_evaluation.py#L173-L181 Can we introduce a
num_proc
parameter in theInContextLearningDataset
constructors so that the example preparation can instead be done like this:This greatly increases the speed of loading larger datasets.