Multi-GPU model in pytorch0.3 consumes much more memory than pytorch0.1 version

ZhengRui commented 6 years ago

Just tried the new implementation in pytorch0.3, but it consumes much more memory than old implementation. Some issues:

when the model runs on a single gpu, it still allocates shared storage on all the gpus, i think the for device_idx in range(torch.cuda.device_count()) part in _SharedAllocation() part requires some modification and optimization.
when the model runs on multi gpu, the batch size it can afford is much less than the batch size of single gpu times number of gpu. From my test it can only afford same size as single gpu version.

gpleiss commented 6 years ago

@ZhengRui when I ran the demo script (latest version of master, pytorch 0.3), it consumes the same amount of memory as the original implementation (within ~100MB). Do you notice the memory increase when you run the demo script? Can you give me specific steps to replicate?

1) "when the model runs on a single gpu, it still allocates shared storage on all the gpus" - if you don't want memory allocated on all GPUs, run the script with CUDA_VISIBLE_DEVICES=___.

2) "when the model runs on multi gpu, the batch size it can afford is much less than the batch size of single gpu times number of gpu" - again, can you provide specific numbers, and an experiment to reproduce? From my experiments, the multi-GPU efficient model had about the same overhead as a normal multi-GPU model.

ZhengRui commented 6 years ago

@gpleiss I just ran some more tests. The memory differences is indeed similar in single GPU case. When i compare it on my desktop with 4 gpus for multi gpu case, the memory usage is quite different:

old densenet201 model can support batch size 440 per gpu (even in training mode with bp):

In [1]: import torch; from torch.autograd import Variable; from densenet_efficient_multi_gpu import DenseNetEfficientMulti; net = torch.nn.DataParallel(DenseNe
   ...: tEfficientMulti(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, cifar=False)).cuda(); net.eval()
In [2]: o = net(Variable(torch.rand(1760,3,224,224), volatile=True).cuda())

memory usage:

Mon Mar 19 23:53:19 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
| 28%   58C    P8    19W / 250W |   7668MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:65:00.0 Off |                  N/A |
| 24%   56C    P8    19W / 250W |   6532MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:B5:00.0  On |                  N/A |
| 16%   56C    P2    79W / 250W |   7225MiB / 11170MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:B6:00.0 Off |                  N/A |
| 21%   55C    P8    16W / 250W |   6868MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     37915      C   python                                      7657MiB |
|    1     37915      C   python                                      6521MiB |
|    2      1145      G   /usr/lib/xorg/Xorg                           256MiB |
|    2      1991      G   compiz                                        59MiB |
|    2      2326      G   ...-token=C676692CF525BEB157863C635C1C3915    47MiB |
|    2     37915      C   python                                      6857MiB |
|    3     37915      C   python                                      6857MiB |
+-----------------------------------------------------------------------------+

new densenet201 model can not support that batch size (even for inference), tested in both pytorch0.2 and pytorch0.3, similar memory usage:

In [1]: import torch; from torch.autograd import Variable; from densenet_efficient_pth3 import DenseNetEfficient; net = torch.nn.DataParallel(DenseNetEfficient
   ...: (growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False)).cuda(); net.eval()
In [2]: o = net(Variable(torch.rand(1700,3,224,224), volatile=True).cuda())

Mon Mar 19 23:59:33 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
| 37%   61C    P8    20W / 250W |  10686MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:65:00.0 Off |                  N/A |
| 35%   60C    P8    20W / 250W |   9912MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:B5:00.0  On |                  N/A |
| 14%   57C    P2    79W / 250W |  10482MiB / 11170MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:B6:00.0 Off |                  N/A |
| 31%   58C    P8    17W / 250W |  10238MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     39547      C   python                                     10675MiB |
|    1     39547      C   python                                      9901MiB |
|    2      1145      G   /usr/lib/xorg/Xorg                           178MiB |
|    2      2326      G   ...-token=C676692CF525BEB157863C635C1C3915    71MiB |
|    2     39547      C   python                                     10225MiB |
|    2     40103      G   compiz                                         3MiB |
|    3     39547      C   python                                     10227MiB |
+-----------------------------------------------------------------------------+

For the 2nd issue I had above, i think it is because of myself, I added some conditions in _SharedAllocation() to differentiate single gpu and multi gpu cases which might caused the second issue i had.

gpleiss commented 6 years ago

Thanks @ZhengRui for the profiling! I'll look into this later this week.

gpleiss commented 6 years ago

This will be fixed with the PyTorch 0.4 updates in #35

ZhengRui commented 6 years ago

great, will test it soon :grinning:

gpleiss commented 6 years ago

Closed by #35

ZhengRui commented 6 years ago

@gpleiss, can you share some memory comparison results between the new pytorch0.4 model and old models, i found the pytorch0.4 model consumes way more memory even in single gpu. Here is a simple test of densenet201 on 1080Ti (11G), can only afford batch size around 100, while the old model can support batch size around 600:

import torch
from densenet import DenseNet
net = DenseNet(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False, efficient=True).cuda()
net.eval()
o = net(torch.rand(100,3,224,224).cuda())

ZhengRui commented 6 years ago

my bad, should have read the 0.4 migration guide more carefully, the proper inference code should be like

import torch
from densenet import DenseNet
net = DenseNet(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False, efficient=True).cuda()
net.eval()
with torch.no_grad():
    o = net(torch.rand(600,3,224,224).cuda())

not the batch size can also support 600 with the pytorch0.4 model and memory usage is more stable than before, nice update.

gpleiss / efficient_densenet_pytorch

Multi-GPU model in pytorch0.3 consumes much more memory than pytorch0.1 version #31