Closed MaureenZOU closed 2 months ago
Hi, do you mean GPU-0 takes up 730MB more memory than other GPUs?
Could you please provide mode details about the issue? Than we may figure it out together :)
So it is the following case: GPU0 Memory: 2000M + 730M + 730M + 730M GPU1 Memory: 2000M GPU2 Memory: 2000M GPU3 Memory: 2000M This seems like some communication issue, while GPU0 holds constant redundant memory.
Thanks for the reply!
Hi, by default, args.distributed_dataset_storage is true. This flag enables saving/loading all images from GPU-0 and then broadcast these images to other GPU. That will cause more GPU memory consumption on GPU-0. However, I did not expect that to be as large as 730MB*3 .
@MaureenZOU We can schedule a meeting to figure out this issue together if you think this is important to you? my email is: hz3496@nyu.edu
Thanks so much for your helpful reply! I will check whether it is caused by args.distributed_dataset_storage, as I just tried the standard dataset for use, so that won't be really problematic for now : ) Great work BTW!
OK. Thank you for your kind words. let me know if you need any discussion and help!
This is caused by a mystery conda environment bug. Haven't fixed this yet, after the torch.distributed init, those zombie processes will naturally appear. Not related to the great code base.
Any people meet with the case that there are many fixed GPU slots used in gpu0 than others. e.g. xueyanz(730M) xueyanz(730M) xueyanz(730M) xueyanz(730M) ...
Thanks!