Closed ZhengRui closed 6 years ago
@ZhengRui when I ran the demo script (latest version of master, pytorch 0.3), it consumes the same amount of memory as the original implementation (within ~100MB). Do you notice the memory increase when you run the demo script? Can you give me specific steps to replicate?
1) "when the model runs on a single gpu, it still allocates shared storage on all the gpus" - if you don't want memory allocated on all GPUs, run the script with CUDA_VISIBLE_DEVICES=___
.
2) "when the model runs on multi gpu, the batch size it can afford is much less than the batch size of single gpu times number of gpu" - again, can you provide specific numbers, and an experiment to reproduce? From my experiments, the multi-GPU efficient model had about the same overhead as a normal multi-GPU model.
@gpleiss I just ran some more tests. The memory differences is indeed similar in single GPU case. When i compare it on my desktop with 4 gpus for multi gpu case, the memory usage is quite different:
old densenet201 model can support batch size 440 per gpu (even in training mode with bp):
In [1]: import torch; from torch.autograd import Variable; from densenet_efficient_multi_gpu import DenseNetEfficientMulti; net = torch.nn.DataParallel(DenseNe
...: tEfficientMulti(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, cifar=False)).cuda(); net.eval()
In [2]: o = net(Variable(torch.rand(1760,3,224,224), volatile=True).cuda())
memory usage:
Mon Mar 19 23:53:19 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:17:00.0 Off | N/A |
| 28% 58C P8 19W / 250W | 7668MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:65:00.0 Off | N/A |
| 24% 56C P8 19W / 250W | 6532MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:B5:00.0 On | N/A |
| 16% 56C P2 79W / 250W | 7225MiB / 11170MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:B6:00.0 Off | N/A |
| 21% 55C P8 16W / 250W | 6868MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 37915 C python 7657MiB |
| 1 37915 C python 6521MiB |
| 2 1145 G /usr/lib/xorg/Xorg 256MiB |
| 2 1991 G compiz 59MiB |
| 2 2326 G ...-token=C676692CF525BEB157863C635C1C3915 47MiB |
| 2 37915 C python 6857MiB |
| 3 37915 C python 6857MiB |
+-----------------------------------------------------------------------------+
new densenet201 model can not support that batch size (even for inference), tested in both pytorch0.2 and pytorch0.3, similar memory usage:
In [1]: import torch; from torch.autograd import Variable; from densenet_efficient_pth3 import DenseNetEfficient; net = torch.nn.DataParallel(DenseNetEfficient
...: (growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False)).cuda(); net.eval()
In [2]: o = net(Variable(torch.rand(1700,3,224,224), volatile=True).cuda())
Mon Mar 19 23:59:33 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:17:00.0 Off | N/A |
| 37% 61C P8 20W / 250W | 10686MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:65:00.0 Off | N/A |
| 35% 60C P8 20W / 250W | 9912MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:B5:00.0 On | N/A |
| 14% 57C P2 79W / 250W | 10482MiB / 11170MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:B6:00.0 Off | N/A |
| 31% 58C P8 17W / 250W | 10238MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 39547 C python 10675MiB |
| 1 39547 C python 9901MiB |
| 2 1145 G /usr/lib/xorg/Xorg 178MiB |
| 2 2326 G ...-token=C676692CF525BEB157863C635C1C3915 71MiB |
| 2 39547 C python 10225MiB |
| 2 40103 G compiz 3MiB |
| 3 39547 C python 10227MiB |
+-----------------------------------------------------------------------------+
For the 2nd issue I had above, i think it is because of myself, I added some conditions in _SharedAllocation()
to differentiate single gpu and multi gpu cases which might caused the second issue i had.
Thanks @ZhengRui for the profiling! I'll look into this later this week.
This will be fixed with the PyTorch 0.4 updates in #35
great, will test it soon :grinning:
Closed by #35
@gpleiss, can you share some memory comparison results between the new pytorch0.4 model and old models, i found the pytorch0.4 model consumes way more memory even in single gpu. Here is a simple test of densenet201 on 1080Ti (11G), can only afford batch size around 100, while the old model can support batch size around 600:
import torch
from densenet import DenseNet
net = DenseNet(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False, efficient=True).cuda()
net.eval()
o = net(torch.rand(100,3,224,224).cuda())
my bad, should have read the 0.4 migration guide more carefully, the proper inference code should be like
import torch
from densenet import DenseNet
net = DenseNet(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False, efficient=True).cuda()
net.eval()
with torch.no_grad():
o = net(torch.rand(600,3,224,224).cuda())
not the batch size can also support 600 with the pytorch0.4 model and memory usage is more stable than before, nice update.
Just tried the new implementation in pytorch0.3, but it consumes much more memory than old implementation. Some issues:
when the model runs on a single gpu, it still allocates shared storage on all the gpus, i think the
for device_idx in range(torch.cuda.device_count())
part in_SharedAllocation()
part requires some modification and optimization.when the model runs on multi gpu, the batch size it can afford is much less than the batch size of single gpu times number of gpu. From my test it can only afford same size as single gpu version.