Thanks for your error report and we appreciate it a lot.
Checklist
I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.
Describe the bug
_I choose any GPU different to 0, by exporting CUDA_VISIBLE_DEVICES and when I see the output of nvidia-smi it shows up that the process is starting to run on that GPU, but when the training is going to start, it tells that the GPU 0 is out of memory (because there is already a training in GPU 0). So when the GPU 0 is busy, I can't train in any other GPU because of this._
Did you make any modifications on the code or config? Did you understand what you have modified?
I didn't modify the code. I modified the configs, but some minor changes and, yes, I understand what I modified.
What dataset did you use?
I use a CocoDataset with custom classes.
Environment
Please run python mmdet/utils/collect_env.py to collect necessary environment infomation and paste it here.
You may add addition that may be helpful for locating the problem, such as
How you installed PyTorch [e.g., pip, conda, source]
I have installed with conda, following the instructions of INSTALL.md
Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)
Error traceback
If applicable, paste the error trackback here.
2020-06-09 08:54:15.283 | ERROR | __main__:<module>:211 - An error has been caught in function '<module>', process 'MainProcess' (3645), thread 'MainThread' (140071342077760):
Traceback (most recent call last):
> File "stages/run_experiment_mlflow.py", line 211, in <module>
main()
File "stages/run_experiment_mlflow.py", line 207, in main
fire.Fire(run_experiment_mlflow)
component_trace = _Fire(component, args, parsed_flag_args, context, name)
│ │ │ │ │ └ 'run_experiment_mlflow.py'
│ │ │ │ └ {}
│ │ │ └ Namespace(completion=None, help=False, interactive=False, separator='-', trace=False, verbose=False)
│ └ <attribute '__name__' of 'function' objects>
└ <function run_experiment_mlflow at 0x7f63fde316a8>
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
File "stages/run_experiment_mlflow.py", line 161, in run_experiment_mlflow
meta=meta)
└ {'env_info': 'sys.platform: linux\nPython: 3.6.10 |Anaconda, Inc.| (default, May 8 2020, 02:54:21) [GCC 7.3.0]\nCUDA availab...
│ │ └ [<torch.utils.data.dataloader.DataLoader object at 0x7f644c059940>]
│ └ <function Runner.run at 0x7f6463f3f400>
└ <mmcv.runner.runner.Runner object at 0x7f644c0599b0>
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmcv/runner/runner.py", line 384, in run
epoch_runner(data_loaders[i], **kwargs)
│ │ │ └ {}
│ └ [<torch.utils.data.dataloader.DataLoader object at 0x7f644c059940>]
└ <bound method Runner.train of <mmcv.runner.runner.Runner object at 0x7f644c0599b0>>
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmcv/runner/runner.py", line 283, in train
self.model, data_batch, train_mode=True, **kwargs)
│ │ │ └ {}
│ │ └ {'img_metas': DataContainer([[{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/i...
└ <mmcv.runner.runner.Runner object at 0x7f644c0599b0>
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/apis/train.py", line 74, in batch_processor
losses = model(**data)
│ └ {'img_metas': DataContainer([[{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/i...
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
│ │ │ └ {'img_metas': DataContainer([[{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/i...
(backbone): ResNeXt(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2,...
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
│ │ └ ({'img_metas': [{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/images/00_09_30...
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2,...
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
│ │ │ └ {'img_metas': [{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/images/00_09_30_...
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/core/fp16/decorators.py", line 49, in new_func
return old_func(*args, **kwargs)
│ │ └ {'img_metas': [{'filename': '/media/VA/databases/DRONE_VS_BIRD/crops/DroneVsBird_train_SAMPLED_0.1_CROPS_720/images/00_09_30_...
│ └ (CascadeRCNN(
│ (backbone): ResNeXt(
│ (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False...
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 148, in forward
│ │ └ tensor([[[[ 0.7591, 0.7591, 0.7419, ..., -2.1179, -2.1179, -2.1179],
│ │ [ 0.7591, 0.7591, 0.7419, ..., -2.1179, ...
│ └ <function TwoStageDetector.forward_train at 0x7f640086bea0>
x = self.extract_feat(img)
│ │ └ tensor([[[[ 0.7591, 0.7591, 0.7419, ..., -2.1179, -2.1179, -2.1179],
│ │ [ 0.7591, 0.7591, 0.7419, ..., -2.1179, ...
│ └ <function TwoStageDetector.extract_feat at 0x7f640086bd90>
└ CascadeRCNN(
│ └ tensor([[[[ 0.7591, 0.7591, 0.7419, ..., -2.1179, -2.1179, -2.1179],
│ [ 0.7591, 0.7591, 0.7419, ..., -2.1179, ...
└ CascadeRCNN(
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
│ │ │ └ {}
│ │ └ (tensor([[[[ 0.7591, 0.7591, 0.7419, ..., -2.1179, -2.1179, -2.1179],
│ │ [ 0.7591, 0.7591, 0.7419, ..., -2.1179,...
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/models/backbones/resnet.py", line 601, in forward
x = res_layer(x)
│ └ tensor([[[[0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
│ 0.0000e+00, 0.0000e+00],
│ [0.0000e+00, 0...
└ ResLayer(
(bn1): BatchN...
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
│ │ 0.0000e+00, 0.0000e+00],
│ │ [0.0000e+00, ...
│ └ <function Sequential.forward at 0x7f648a31fb70>
└ ResLayer(
(0): Bottleneck(
(conv1): Conv2d(1024, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchN...
│ └ tensor([[[[0.9162, 1.3605, 1.4720, ..., 0.9338, 1.0213, 1.5810],
│ [0.4824, 0.0918, 0.1627, ..., 0.0000, 0.0462, 0....
└ Bottleneck(
(conv1): Conv2d(2048, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(1024, eps=1e-05...
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
│ │ │ └ {}
│ │ └ (tensor([[[[0.9162, 1.3605, 1.4720, ..., 0.9338, 1.0213, 1.5810],
│ │ [0.4824, 0.0918, 0.1627, ..., 0.0000, 0.0462, 0...
│ └ <function Bottleneck.forward at 0x7f6400d616a8>
└ Bottleneck(
(conv1): Conv2d(2048, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(1024, eps=1e-05...
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/models/backbones/resnet.py", line 280, in forward
out = _inner_forward(x)
│ └ tensor([[[[0.9162, 1.3605, 1.4720, ..., 0.9338, 1.0213, 1.5810],
│ [0.4824, 0.0918, 0.1627, ..., 0.0000, 0.0462, 0....
└ <function Bottleneck.forward.<locals>._inner_forward at 0x7f6453a82598>
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/mmdet/models/backbones/resnet.py", line 258, in _inner_forward
out = self.norm2(out)
│ │ └ tensor([[[[-7.8127e-03, -8.7999e-03, -8.6583e-03, ..., -7.1499e-03,
│ │ -7.6180e-03, -4.3470e-03],
│ │ [-7.3230...
│ └ <property object at 0x7f6400d55db8>
└ Bottleneck(
(conv1): Conv2d(2048, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(1024, eps=1e-05...
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
│ │ │ └ {}
│ │ └ (tensor([[[[-7.8127e-03, -8.7999e-03, -8.6583e-03, ..., -7.1499e-03,
│ │ -7.6180e-03, -4.3470e-03],
│ │ [-7.323...
│ └ <function _BatchNorm.forward at 0x7f648a2ed620>
└ BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward
exponential_average_factor, self.eps)
│ │ └ 1e-05
│ └ BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
└ 0.1
File "/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/nn/functional.py", line 1670, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
│ │ │ │ │ │ └ <torch.backends.ContextProp object at 0x7f648a2369e8>
│ │ │ │ │ └ <module 'torch.backends.cudnn' from '/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-...
│ │ │ │ └ <module 'torch.backends' from '/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packag...
│ │ │ └ <module 'torch' from '/opt/miniconda3/envs/mlflow-185b9c781ffde5765f6a22a12949cf7cfa347af5/lib/python3.6/site-packages/torch/...
│ │ └ 1e-05
│ └ 0.1
└ False
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 0; 10.76 GiB total capacity; 9.81 GiB already allocated; 19.12 MiB free; 9.96 GiB reserved in total by PyTorch)
Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
I haven't identified the reason, but if you can give me some hints about what the problem could be, I would be glad to help and send a PR if I can fix it.
Sorry. I think it's my fault. Although it says GPU 0 is out of memory, it looks like it is talking about the correct GPU. So with a lower batch size this error goes away. I close this issue, because it is not a bug.
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug _I choose any GPU different to 0, by exporting
CUDA_VISIBLE_DEVICES
and when I see the output ofnvidia-smi
it shows up that the process is starting to run on that GPU, but when the training is going to start, it tells that the GPU 0 is out of memory (because there is already a training in GPU 0). So when the GPU 0 is busy, I can't train in any other GPU because of this._Reproduction
What command or script did you run?
Did you make any modifications on the code or config? Did you understand what you have modified? I didn't modify the code. I modified the configs, but some minor changes and, yes, I understand what I modified.
What dataset did you use? I use a CocoDataset with custom classes.
Environment
python mmdet/utils/collect_env.py
to collect necessary environment infomation and paste it here.$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)Error traceback If applicable, paste the error trackback here.
Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated! I haven't identified the reason, but if you can give me some hints about what the problem could be, I would be glad to help and send a PR if I can fix it.