Open pbontrager opened 3 months ago
Thanks for pointing this out @pbontrager. I've also found this is an issue with grammar_dataset
. HF datasets that run remote code may be more prevalent than we thought, so we should prioritize this in the near future. I can look into it.
It might be worth checking what setting num_processes in load_dataset would do. Documentation
Description
When you load a dataset from HF with remote code, the load_dataset function prompts the user for permission to run remote code. This prompt only happens the first time the user downloads the dataset, but will cause a crash if the first time the user uses a dataset is with a distributed recipe. The likely causes a crash because load_dataset is called on all distributed processes, but the user is only prompted for permission on rank 0. The user will give permission but all the other ranks will hang waiting for a response which causes torch.distributed to crash.
Reproduce
This can be reproduced with
cnn_dailymail_articles_dataset
which runs remote code.First ensure that you don't have the dataset cached (this will remove all cached datasets but it's the only way to guarantee you remove the cached files.
Then run any distributed recipe with this dataset. For example
This should reach dataset initialization and then ask for permission to run remote code. Whichever response you provide will cause the recipe to crash since the processes will be out of sync after.
Possible Solutions
trust_remote_code=True
for datasets we know and support likecnn_dailymail_articles_dataset
but we risk making users vulnerable to changes in hf hub.cnn_dailymail_articles_dataset
on rank 0, and then broadcast the answer via a torch tensor of size 1 to all the other ranks.