Allow multiprocessing when preparing ICL dataset

mosaicml / llm-foundry

LLM training code for Databricks foundation models

https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Apache License 2.0

3.84k stars 503 forks source link

Allow multiprocessing when preparing ICL dataset #1276

Open sanjari-orb opened 3 weeks ago

sanjari-orb commented 3 weeks ago

🚀 Feature Request

Allow passing num_proc/num_workers parameter inInContextLearningDataset so that preparation of dataset can use more than one processes.

Motivation

When loading bigger ICL eval datasets, it is desirable to pass num_procs>1 in the following map function, which preps each example in the dataset: https://github.com/mosaicml/llm-foundry/blob/5571101a50804406ef0fe23e7ea6795b3c4a1bcb/llmfoundry/eval/datasets/in_context_learning_evaluation.py#L173-L181 Can we introduce a num_proc parameter in the InContextLearningDataset constructors so that the example preparation can instead be done like this:

        self.dataset: HFDataset = self.dataset.map(
            self._prep_example,
            with_indices=True,
            num_proc=num_proc,
            fn_kwargs={
                'num_fewshot': num_fewshot,
                'prompt_string': prompt_string,
                'fewshot_rng': fewshot_rng,
            },
        )

This greatly increases the speed of loading larger datasets.

dakinggg commented 2 weeks ago

@sanjari-orb sure! My only hesitation in doing this is that we've observed occasional hangs when using hf datasets and multiprocessing (https://github.com/huggingface/datasets/issues/6393), but should be fine, especially if we keep it single process by default. Would be happy to accept a PR adding the arg.

sanjari-orb commented 2 weeks ago

Actually we ended up seeing the same problem of the map() hanging while loading ICL evaluations with num_proc>1, and unluckily this happens frequently enough. Do you have any insights on how this problem was solved in mosaicml?

dakinggg commented 2 weeks ago

Unfortunately I have never managed to fully root cause this issue (feel free to comment on the datasets issue, as I don't think they have been able to fix it either). However, I believe it has something to do with multiple processes processing the same data at the same time. As a result, in the main dataloader we have local rank 0 go first, so that all the other ranks are just reading data cached on disk. We could probably apply the same logic in the ICL classes.

sanjari-orb commented 2 weeks ago

Could you give me a pointer to where this is being handled?

dakinggg commented 2 weeks ago

Ah yeah sorry, meant to include the link. https://github.com/mosaicml/llm-foundry/blob/2196d073b296e28c5e84852954f53721cb1cc5e5/llmfoundry/data/finetuning/tasks.py#L831-L837 for the wait, and https://github.com/mosaicml/llm-foundry/blob/2196d073b296e28c5e84852954f53721cb1cc5e5/llmfoundry/data/finetuning/tasks.py#L945-L956 for the cleanup. I added some nicer utils for this to composer also (https://github.com/mosaicml/composer/pull/3396) but haven't updated foundry yet to use them.

sanjari-orb commented 2 weeks ago

We are already doing that here though right? https://github.com/mosaicml/llm-foundry/blob/2196d073b296e28c5e84852954f53721cb1cc5e5/llmfoundry/eval/datasets/in_context_learning_evaluation.py#L265-L268

dakinggg commented 2 weeks ago

not quite. in the code I linked we have rank 0 go first for the dataset load. In the code you linked, we have only rank 0 download the file, but then all ranks would call load_dataset at the same time

sanjari-orb commented 2 weeks ago

Ah gotcha. Okay let me try this. Thanks!