open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.14k stars 9.39k forks source link

CUDA out of memory in GPU 0, when training in GPU different to 0 #3020

Closed igonro closed 4 years ago

igonro commented 4 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug _I choose any GPU different to 0, by exporting CUDA_VISIBLE_DEVICES and when I see the output of nvidia-smi it shows up that the process is starting to run on that GPU, but when the training is going to start, it tells that the GPU 0 is out of memory (because there is already a training in GPU 0). So when the GPU 0 is busy, I can't train in any other GPU because of this._

Reproduction

  1. What command or script did you run?

    export CUDA_VISIBLE_DEVICES=1
    python tools/train.py .../configs/cascade.py --work-dir .../work_dirs/
  2. Did you make any modifications on the code or config? Did you understand what you have modified? I didn't modify the code. I modified the configs, but some minor changes and, yes, I understand what I modified.

  3. What dataset did you use? I use a CocoDataset with custom classes.

Environment

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment infomation and paste it here.
sys.platform: linux
Python: 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.105
GPU 0: GeForce GTX 1080 Ti
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.4.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.0
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.5.0
OpenCV: 4.2.0
MMCV: 0.5.9
MMDetection: 2.0.0+c77ccbb
MMDetection Compiler: GCC 7.5
MMDetection CUDA Compiler: 10.1
  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source] I have installed with conda, following the instructions of INSTALL.md
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback If applicable, paste the error trackback here.

2020-06-09 08:54:15.283 | ERROR    | __main__:<module>:211 - An error has been caught in function '<module>', process 'MainProcess' (3645), thread 'MainThread' (140071342077760):
Traceback (most recent call last):

> File "stages/run_experiment_mlflow.py", line 211, in <module>
    main()
  File "stages/run_experiment_mlflow.py", line 207, in main
    fire.Fire(run_experiment_mlflow)
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      │     │          │     │                 │        └ 'run_experiment_mlflow.py'
                      │     │          │     │                 └ {}
                      │     │          │     └ Namespace(completion=None, help=False, interactive=False, separator='-', trace=False, verbose=False)
           │         └ <attribute '__name__' of 'function' objects>
           └ <function run_experiment_mlflow at 0x7f63fde316a8>
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
  File "stages/run_experiment_mlflow.py", line 161, in run_experiment_mlflow
    meta=meta)
         └ {'env_info': 'sys.platform: linux\nPython: 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21) [GCC 7.3.0]\nCUDA availab...

    │      │   └ [<torch.utils.data.dataloader.DataLoader object at 0x7f644c059940>]
    │      └ <function Runner.run at 0x7f6463f3f400>
    └ <mmcv.runner.runner.Runner object at 0x7f644c0599b0>
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmcv/runner/runner.py", line 384, in run
    epoch_runner(data_loaders[i], **kwargs)
    │            │            │     └ {}
    │            └ [<torch.utils.data.dataloader.DataLoader object at 0x7f644c059940>]
    └ <bound method Runner.train of <mmcv.runner.runner.Runner object at 0x7f644c0599b0>>
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmcv/runner/runner.py", line 283, in train
    self.model, data_batch, train_mode=True, **kwargs)
    │    │      │                              └ {}
    │    │      └ {'img_metas': DataContainer([[{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/i...
    └ <mmcv.runner.runner.Runner object at 0x7f644c0599b0>
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/apis/train.py", line 74, in batch_processor
    losses = model(**data)
             │       └ {'img_metas': DataContainer([[{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/i...
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
             │    │        │        └ {'img_metas': DataContainer([[{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/i...
                   (backbone): ResNeXt(
                     (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2,...
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
           │            │            └ ({'img_metas': [{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/images/00_09_30...
                   (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2,...
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
             │    │        │        └ {'img_metas': [{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/images/00_09_30_...
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/core/fp16/decorators.py", line 49, in new_func
    return old_func(*args, **kwargs)
           │         │       └ {'img_metas': [{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/images/00_09_30_...
           │         └ (CascadeRCNN(
           │             (backbone): ResNeXt(
           │               (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False...
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 148, in forward
           │    │             └ tensor([[[[ 0.7591,  0.7591,  0.7419,  ..., -2.1179, -2.1179, -2.1179],
           │    │                         [ 0.7591,  0.7591,  0.7419,  ..., -2.1179, ...
           │    └ <function TwoStageDetector.forward_train at 0x7f640086bea0>
    x = self.extract_feat(img)
        │    │            └ tensor([[[[ 0.7591,  0.7591,  0.7419,  ..., -2.1179, -2.1179, -2.1179],
        │    │                        [ 0.7591,  0.7591,  0.7419,  ..., -2.1179, ...
        │    └ <function TwoStageDetector.extract_feat at 0x7f640086bd90>
        └ CascadeRCNN(
        │             └ tensor([[[[ 0.7591,  0.7591,  0.7419,  ..., -2.1179, -2.1179, -2.1179],
        │                         [ 0.7591,  0.7591,  0.7419,  ..., -2.1179, ...
        └ CascadeRCNN(
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
             │    │        │        └ {}
             │    │        └ (tensor([[[[ 0.7591,  0.7591,  0.7419,  ..., -2.1179, -2.1179, -2.1179],
             │    │                    [ 0.7591,  0.7591,  0.7419,  ..., -2.1179,...
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/models/backbones/resnet.py", line 601, in forward
    x = res_layer(x)
        │         └ tensor([[[[0.0000e+00, 0.0000e+00, 0.0000e+00,  ..., 0.0000e+00,
        │                      0.0000e+00, 0.0000e+00],
        │                     [0.0000e+00, 0...
        └ ResLayer(
              (bn1): BatchN...
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
             │    │                     0.0000e+00, 0.0000e+00],
             │    │                    [0.0000e+00, ...
             │    └ <function Sequential.forward at 0x7f648a31fb70>
             └ ResLayer(
                 (0): Bottleneck(
                   (conv1): Conv2d(1024, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
                   (bn1): BatchN...
            │      └ tensor([[[[0.9162, 1.3605, 1.4720,  ..., 0.9338, 1.0213, 1.5810],
            │                  [0.4824, 0.0918, 0.1627,  ..., 0.0000, 0.0462, 0....
            └ Bottleneck(
                (conv1): Conv2d(2048, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
                (bn1): BatchNorm2d(1024, eps=1e-05...
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
             │    │        │        └ {}
             │    │        └ (tensor([[[[0.9162, 1.3605, 1.4720,  ..., 0.9338, 1.0213, 1.5810],
             │    │                    [0.4824, 0.0918, 0.1627,  ..., 0.0000, 0.0462, 0...
             │    └ <function Bottleneck.forward at 0x7f6400d616a8>
             └ Bottleneck(
                 (conv1): Conv2d(2048, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
                 (bn1): BatchNorm2d(1024, eps=1e-05...
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/models/backbones/resnet.py", line 280, in forward
    out = _inner_forward(x)
          │              └ tensor([[[[0.9162, 1.3605, 1.4720,  ..., 0.9338, 1.0213, 1.5810],
          │                          [0.4824, 0.0918, 0.1627,  ..., 0.0000, 0.0462, 0....
          └ <function Bottleneck.forward.<locals>._inner_forward at 0x7f6453a82598>
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/models/backbones/resnet.py", line 258, in _inner_forward
    out = self.norm2(out)
          │    │     └ tensor([[[[-7.8127e-03, -8.7999e-03, -8.6583e-03,  ..., -7.1499e-03,
          │    │                  -7.6180e-03, -4.3470e-03],
          │    │                 [-7.3230...
          │    └ <property object at 0x7f6400d55db8>
          └ Bottleneck(
              (conv1): Conv2d(2048, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (bn1): BatchNorm2d(1024, eps=1e-05...
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
             │    │        │        └ {}
             │    │        └ (tensor([[[[-7.8127e-03, -8.7999e-03, -8.6583e-03,  ..., -7.1499e-03,
             │    │                     -7.6180e-03, -4.3470e-03],
             │    │                    [-7.323...
             │    └ <function _BatchNorm.forward at 0x7f648a2ed620>
             └ BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward
    exponential_average_factor, self.eps)
    │                           │    └ 1e-05
    │                           └ BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    └ 0.1
  File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/functional.py", line 1670, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
    │         │         │    │     │        │     └ <torch.backends.ContextProp object at 0x7f648a2369e8>
    │         │         │    │     │        └ <module 'torch.backends.cudnn' from '/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-...
    │         │         │    │     └ <module 'torch.backends' from '/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packag...
    │         │         │    └ <module 'torch' from '/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/...
    │         │         └ 1e-05
    │         └ 0.1
    └ False

RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 0; 10.76 GiB total capacity; 9.81 GiB already allocated; 19.12 MiB free; 9.96 GiB reserved in total by PyTorch)

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated! I haven't identified the reason, but if you can give me some hints about what the problem could be, I would be glad to help and send a PR if I can fix it.

igonro commented 4 years ago

Sorry. I think it's my fault. Although it says GPU 0 is out of memory, it looks like it is talking about the correct GPU. So with a lower batch size this error goes away. I close this issue, because it is not a bug.