odissei-lifecourse / life-sequencing-dutch

MIT License
0 stars 0 forks source link

Improve GPU training performance with DDP #74

Open f-hafner opened 1 month ago

f-hafner commented 1 month ago

Investigate ways to bring GPU utilization to as close as 100% as possible and maximize model throughput. Focus on multi-GPU on a single node.

Collecting some questions from me and @benczaja -- feel free to add/modify

benczaja commented 1 month ago

Need to be carfull when doing DDP and iterable datasets https://github.com/huggingface/datasets/issues/3423

https://github.com/huggingface/datasets/issues/5360

https://github.com/huggingface/datasets/pull/5357

f-hafner commented 3 weeks ago

I also did some research and I think we have a combination of 2 issues: using multiple workers (CPUs) to load the data, and using multiple GPUs to train the model. And this is all specific to torch lightning (https://discuss.pytorch.org/t/using-iterabledataset-with-distributeddataparallel/92589).

(a) Using multiple workers for data loading

(b) Using multiple GPUs for training

this issue is with 2 GPUs and 1 worker. the first answer from awaelchli mentions sharding the data, which I understand as distributing them across all GPU devices. The recommended way to do this is with trainer.global_rank.

I think this is what is implemented in our current code here:

https://github.com/odissei-lifecourse/life-sequencing-dutch/blob/bc4a57b2e5f57217e721ccb841521d8bb5af6d65/pop2vec/llm/src/new_code/load_data.py#L40C20-L40C56

although not with trainer.global_rank but with int(os.environ.get("WORLD_SIZE", 1)) and int(os.environ.get("LOCAL_RANK", 0)). world size is the number of processes involved in training = N gpus.

Further down in the issue they discuss exactly our problem with multiple GPUs, multiple workers in the data loader, and torch lightning. There is a link to an example notebook that I think could be very useful for us: https://colab.research.google.com/drive/1OFLZnX9y5QUFNONuvFsxOizq4M-tFvk-?usp=sharing

In sum, I propose