liamcli / darts

Apache License 2.0
3 stars 12 forks source link

pytorch 1.1 OOM #1

Closed VincentChong123 closed 5 years ago

VincentChong123 commented 5 years ago

Hi @liamcli,

Thanks for sharing your works.

NOTE: PyTorch 0.4 is not supported at this moment and would lead to OOM.

Above OOM refers to error below?

darts/cnn/test.py:86: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. target = Variable(target, volatile=True).cuda(async=True) 06/17 12:18:42 PM test 000 1.233735e-01 96.875000 100.000000 ... File "/opt/venv/usr-python/python3.6/tf-nightly-gpu/lib/python3.6/site-packages/torch/nn/functional.py", line 1697, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 10.73 GiB total capacity; 9.37 GiB already allocated; 6.56 MiB free; 533.84 MiB cached)

My system uses PyTorch 1.1, cuda10.0, cudnn7.x

Thank you.

VincentChong123 commented 5 years ago

Hi @liamcli ,

for your cifar10_model.pt, using Pytorch1.1 and Gtx2080ti, I managed to run cnn/test.py for Cifar10 using batch 56.

Only a single GPU is required. NOTE: PyTorch 0.4 is not supported at this moment and would lead to OOM.

for training, is it possible to use torch.nn.parallel to solve OOM ?

VincentChong123 commented 5 years ago

Only a single GPU is required. NOTE: PyTorch 0.4 is not supported at this moment and would lead to OOM.

Do you mean OOM below? I wonder can Pytorch 0.4 or multiple-GPU solve error below.

for PyTorch 1.1, cuda10.0, cudnn7.x, train.py --auxiliary --cutout <----use default batch 96 for Cifar10, @80% GPU memory utilization ratio

06/17 01:31:31 PM param size = 3.349342MB 06/17 01:31:31 PM Model total parameters: 3825768

logging.info('train %03d %e %f %f', step, objs.avg, top1.avg, top5.avg)

06/17 01:31:33 PM train 000 3.308731e+00 9.375000 40.625000 06/17 01:31:49 PM train 050 3.192118e+00 12.357026 57.107841 ... 06/17 01:34:20 PM train 500 2.565969e+00 31.686626 82.054638 06/17 01:34:26 PM train_acc 32.043999

File "/opt/venv/usr-python/python3.6/tf-nightly-gpu/lib/python3.6/site-packages/torch/nn/functional.py", line 1697, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 10.73 GiB total capacity; 9.35 GiB already allocated; 12.56 MiB free; 554.67 MiB cached)

BTW, thanks sharing the great talk! https://slideslive.com/38916590/random-search-and-reproducibility-for-neural-architecture-search?locale=cs

liamcli commented 5 years ago

This is a fork of the code available at https://github.com/quark0/darts

If the OOM issue happens during evaluation (inference only) then you can look into using "torch.no_grad()" instead of the volatile argument.

VincentChong123 commented 5 years ago

Thanks @liamcli !