Memory usage during multi GPU training

ferrophile commented 3 years ago

I noticed that when training with multiple GPUs, sometimes the spawned processes for some reason seem to occupy memory space on the first GPU, i.e. the memory usage is not distributed equally.

Below is an example training with 7 gpus. All the processes are occupying some memory on GPU3. GPU3 is using 10GB RAM while other GPUs are only using 3GB.

(Only processes with "stylegan2-pytorch" is related to this repository. I'm not sure if other programs running on the GPUs are related to this issue)

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    3      8852      C   ...onda3/envs/stylegan2-pytorch/bin/python  3219MiB |
|    3      8853      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    3      8854      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    3      8855      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    3      8856      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    3      8857      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    3      8858      C   ...onda3/envs/stylegan2-pytorch/bin/python  1205MiB |
|    4      8853      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
|    5      8854      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
|    6      8855      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
|    6     35126      C   python                                      1191MiB |
|    6     41314      C   python                                      1191MiB |
|    7      8856      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
|    8      8857      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
|    8     36933      C   python                                      1191MiB |
|    8     41629      C   python                                      1191MiB |
|    9      8858      C   ...onda3/envs/stylegan2-pytorch/bin/python  3255MiB |
+-----------------------------------------------------------------------------+

Is this normal? While the program seems to run normally, I have had trouble exceeding the memory limit of the first GPU while other GPUs remain underused.

mlyarthur commented 3 years ago

Have you solved this problem? my situation is similar to yours.

ferrophile commented 3 years ago

Sorry I haven't solved it. I switched to using the following repository which can train with 1 GPU only. https://github.com/lucidrains/lightweight-gan

Jiangshuyi0V0 commented 2 years ago

same issue here

lucidrains / stylegan2-pytorch

Memory usage during multi GPU training #214