I am training on a system which has a time limit of 10 hours. So, every time I restart pre-training from last checkpoint, I get OOM error while it was running properly during the previous run with same configuration.
2022-07-23 15:15:37 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 2.29 GiB (GPU 2; 31.75 GiB total capacity; 28.47 GiB already allocated; 1.09 GiB free; 29.63 GiB reserved in total by PyTorch)
As a hack I reduce MAX_TOKENS by 512 every time and then it works. But now I've reached a point where I cannot reduce the MAX_TOKENS further, but still need to train my model further.
Also I've, noticed just one GPU goes OOM. Actually, I've tried to read it online, the cause is Distributed Data Parallel, tries to load all the data to one GPU and then distributes the load to the rest of the GPUs. But not sure how to deal with it.
Resources:
Total GPUs=8; Tesla V100-SXM2-32GB; total memory = 31.749 GB each;
Hi,
I am training on a system which has a time limit of 10 hours. So, every time I restart pre-training from last checkpoint, I get OOM error while it was running properly during the previous run with same configuration.
2022-07-23 15:15:37 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 2.29 GiB (GPU 2; 31.75 GiB total capacity; 28.47 GiB already allocated; 1.09 GiB free; 29.63 GiB reserved in total by PyTorch)
As a hack I reduce
MAX_TOKENS
by 512 every time and then it works. But now I've reached a point where I cannot reduce theMAX_TOKENS
further, but still need to train my model further.Also I've, noticed just one GPU goes OOM. Actually, I've tried to read it online, the cause is Distributed Data Parallel, tries to load all the data to one GPU and then distributes the load to the rest of the GPUs. But not sure how to deal with it.
Resources:
Total GPUs=8; Tesla V100-SXM2-32GB; total memory = 31.749 GB each;
My
pretrain.sh
is as follows:Please suggest how to deal with it.