Improve GPU training performance with DDP

f-hafner commented 1 month ago

Investigate ways to bring GPU utilization to as close as 100% as possible and maximize model throughput. Focus on multi-GPU on a single node.

Collecting some questions from me and @benczaja -- feel free to add/modify

make sure tensor cores are used
avoid possible data loading bottlenecks
understand better what torch lightning DDP strategy does
understand training speed as a function of model size, sequence length and batch size
minimize communication between CPU and GPU

benczaja commented 1 month ago

Need to be carfull when doing DDP and iterable datasets https://github.com/huggingface/datasets/issues/3423

https://github.com/huggingface/datasets/issues/5360

https://github.com/huggingface/datasets/pull/5357

f-hafner commented 3 weeks ago

I also did some research and I think we have a combination of 2 issues: using multiple workers (CPUs) to load the data, and using multiple GPUs to train the model. And this is all specific to torch lightning (https://discuss.pytorch.org/t/using-iterabledataset-with-distributeddataparallel/92589).

(a) Using multiple workers for data loading

this link has a simple reproducible example: https://stackoverflow.com/questions/69778356/iterable-pytorch-dataset-with-multiple-workers.
I tried this on snellius on a CPU node with cores. The code is temporarily on the branch.

(b) Using multiple GPUs for training

this issue is with 2 GPUs and 1 worker. the first answer from awaelchli mentions sharding the data, which I understand as distributing them across all GPU devices. The recommended way to do this is with trainer.global_rank.

I think this is what is implemented in our current code here:

https://github.com/odissei-lifecourse/life-sequencing-dutch/blob/bc4a57b2e5f57217e721ccb841521d8bb5af6d65/pop2vec/llm/src/new_code/load_data.py#L40C20-L40C56

although not with trainer.global_rank but with int(os.environ.get("WORLD_SIZE", 1)) and int(os.environ.get("LOCAL_RANK", 0)). world size is the number of processes involved in training = N gpus.

Further down in the issue they discuss exactly our problem with multiple GPUs, multiple workers in the data loader, and torch lightning. There is a link to an example notebook that I think could be very useful for us: https://colab.research.google.com/drive/1OFLZnX9y5QUFNONuvFsxOizq4M-tFvk-?usp=sharing

In sum, I propose

[ ] @benczaja , can you confirm that using multiple workers for loading the data is still desirable?
[ ] double check the sharding logic currently implemented
[ ] understand the logic in the notebook
[ ] figure out how to extend our current code to multiple workers in the data loader.
[ ] given the complexity and importance of this issue, I think it'd be great to have some unit tests for this too.

odissei-lifecourse / life-sequencing-dutch

Improve GPU training performance with DDP #74

(a) Using multiple workers for data loading

(b) Using multiple GPUs for training