Open f-hafner opened 1 month ago
Need to be carfull when doing DDP and iterable datasets https://github.com/huggingface/datasets/issues/3423
I also did some research and I think we have a combination of 2 issues: using multiple workers (CPUs) to load the data, and using multiple GPUs to train the model. And this is all specific to torch lightning (https://discuss.pytorch.org/t/using-iterabledataset-with-distributeddataparallel/92589).
this issue is with 2 GPUs and 1 worker. the first answer from awaelchli mentions sharding the data, which I understand as distributing them across all GPU devices. The recommended way to do this is with trainer.global_rank
.
I think this is what is implemented in our current code here:
although not with trainer.global_rank
but with int(os.environ.get("WORLD_SIZE", 1))
and int(os.environ.get("LOCAL_RANK", 0))
. world size is the number of processes involved in training = N gpus.
Further down in the issue they discuss exactly our problem with multiple GPUs, multiple workers in the data loader, and torch lightning. There is a link to an example notebook that I think could be very useful for us: https://colab.research.google.com/drive/1OFLZnX9y5QUFNONuvFsxOizq4M-tFvk-?usp=sharing
In sum, I propose
Investigate ways to bring GPU utilization to as close as 100% as possible and maximize model throughput. Focus on multi-GPU on a single node.
Collecting some questions from me and @benczaja -- feel free to add/modify