Closed stas00 closed 3 years ago
I don't have a setup with 8Gb so I have to rely on nvidia-smi numbers. First command is 13.2Gb on GPU0, 6.5Gb on GPU1, second command is 11.2GB on GPU0 and 10.1GB on GPU1.
Thank you for the sanity check, @sgugger
This is very odd that we get such a discrepancy in memory allocation between the 2 gpus on DP! 2x gpu ram on card0.
But this explains why it works for me since I have precisely 24gb + 8gb, so this discrepancy fits just right. So it's unclear if it's a problem in DP or DDP.
I will investigate.
With DP the gradients and optimizer states are only on one GPU, I think that is why we have the big difference. With DDP they are copied over the two.
Oh wow, that's a huge difference. Clearly DP wins here for those with lopsided setups like mine!
OK, then it's by design then. Closing this.
This is a bit of a problem with our memory metrics reporting as we only report gpu0, but I guess since most users will have symmetrical setups (cards of the same size) and gpu0 consumes the biggest amount of memory in DP/DDP then it's OK.
Will have to think how to extend the metrics for setups where it's critical to know each gpu's allocations - e.g. pipeline or model parallel.
I think we may have created a memory regression somewhere recently.
I tried with pt-1.7 and pt-1.8 with the same results.
memory limit on this setup is 8gb
on
transformers
master:This takes about 5.5GB/gpu:
(no need to run more than a few secs, we are just trying to see that the job can start training)
switching to DDP immediately OOMs:
even if I reduce the bs from 4 to 1 it still goes over 8GB.
@sgugger, could you please confirm if you're seeing the same?