Open ferrophile opened 3 years ago
Have you solved this problem? my situation is similar to yours.
Sorry I haven't solved it. I switched to using the following repository which can train with 1 GPU only. https://github.com/lucidrains/lightweight-gan
same issue here
I noticed that when training with multiple GPUs, sometimes the spawned processes for some reason seem to occupy memory space on the first GPU, i.e. the memory usage is not distributed equally.
Below is an example training with 7 gpus. All the processes are occupying some memory on GPU3. GPU3 is using 10GB RAM while other GPUs are only using 3GB.
(Only processes with "stylegan2-pytorch" is related to this repository. I'm not sure if other programs running on the GPUs are related to this issue)
Is this normal? While the program seems to run normally, I have had trouble exceeding the memory limit of the first GPU while other GPUs remain underused.