RuntimeError: CUDA out of memory.

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug

I'm trying to finetune a food segmentation model found here, on new dataset.

When trying to train the model. I got this error. The batch_size is set to 1.

Thank you in advance for any insights you can give.

Reproduction

Command

python train --config SETR_MLA_768x768_80k_jap_finetune.py --gpus 1

Configuration file

norm_cfg = dict(type='BN', requires_grad=True) 
model = dict(
    type='EncoderDecoder',
    backbone=dict(
        type='VIT_MLA',
        model_name='vit_base_patch16_224',
        img_size=768,
        patch_size=16,
        in_chans=3,
        embed_dim=768,
        depth=12,
        num_heads=12,
        num_classes=19,
        drop_rate=0.0,
        norm_cfg=dict(type='BN', requires_grad=True),
        pos_embed_interp=True,
        align_corners=False,
        mla_channels=256,
        mla_index=(5, 7, 9, 11)),
    decode_head=dict(
        type='VIT_MLAHead',
        in_channels=1024,
        channels=512,
        img_size=768,
        mla_channels=256,
        mlahead_channels=128,
        num_classes=104,
        norm_cfg=dict(type='BN', requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=[
        dict(
            type='VIT_MLA_AUXIHead',
            in_channels=256,
            channels=512,
            in_index=0,
            img_size=768,
            num_classes=102,
            align_corners=False,
            loss_decode=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
        dict(
            type='VIT_MLA_AUXIHead',
            in_channels=256,
            channels=512,
            in_index=1,
            img_size=768,
            num_classes=102,
            align_corners=False,
            loss_decode=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
        dict(
            type='VIT_MLA_AUXIHead',
            in_channels=256,
            channels=512,
            in_index=2,
            img_size=768,
            num_classes=102,
            align_corners=False,
            loss_decode=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
        dict(
            type='VIT_MLA_AUXIHead',
            in_channels=256,
            channels=512,
            in_index=3,
            img_size=768,
            num_classes=102,
            align_corners=False,
            loss_decode=dict(
                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4))
    ])
train_cfg = dict()
test_cfg = dict(mode='slide', crop_size=(768, 768), stride=(512, 512))
dataset_type = 'CustomDataset'
data_root = './data/UECFOODPIXCOMPLETE'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (768, 768)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(2049, 1025), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=(768, 768), cat_max_ratio=0.75),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size=(768, 768), pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(2049, 1025),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=1,
    workers_per_gpu=0,
    train=dict(
        type='CustomDataset',
        data_root='./data/UECFOODPIXCOMPLETE/',
        img_dir='img/train',
        ann_dir='ann/train',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
            dict(
                type='Resize', img_scale=(2049, 1025), ratio_range=(0.5, 2.0)),
            dict(type='RandomCrop', crop_size=(768, 768), cat_max_ratio=0.75),
            dict(type='RandomFlip', prob=0.5),
            dict(type='PhotoMetricDistortion'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size=(768, 768), pad_val=0, seg_pad_val=255),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_semantic_seg'])
        ]),
    val=dict(
        type='CustomDataset',
        data_root='./data/UECFOODPIXCOMPLETE',
        img_dir='img/test',
        ann_dir='ann/test',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2049, 1025),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='CustomDataset',
        data_root='./data/UECFOODPIXCOMPLETE',
        img_dir='img/test',
        ann_dir='ann/test',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(2049, 1025),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
log_config = dict(
    interval=50, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = "./models/iter_80000.pth"
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(
    type='SGD',
    lr=0.002,
    momentum=0.9,
    weight_decay=0.0,
    paramwise_cfg=dict(custom_keys=dict(head=dict(lr_mult=10.0))))
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
runner = dict(type='IterBasedRunner', max_iters=80000)
checkpoint_config = dict(by_epoch=False, interval=4000)
evaluation = dict(interval=4000, metric='mIoU')
find_unused_parameters = True
work_dir = 'train_results/'
gpu_ids = range(0, 1)

Dataset

I used this japanese dataset for food segmentation: https://mm.cs.uec.ac.jp/uecfoodpix/ I go the model from Foodseg repo and tried to finetune it on the japanese data.

Environment

sys.platform: linux
Python: 3.7.7 (default, May  7 2020, 21:25:33) [GCC 7.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GCC: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.7.0
OpenCV: 4.5.3
MMCV: 1.3.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMSegmentation: 0.11.0+6a60f26

Error traceback

RuntimeError: CUDA out of memory. Tried to allocate 230.00 MiB (GPU 0; 10.92 GiB total capacity; 9.71 GiB already allocated; 190.69 MiB free; 10.14 GiB reserved in total by PyTorch)

Hi, @Benybrahim

You can try with_cp = True, please check out here for more details.

By the way, SETR needs a lot computation capacities, I suggest you change some non-transformer models which needs less GPU memory because batch_size=1 on GTX 1080 Ti can not ensure normal model training.

Best,

Thank you @MengzhangLI .

Better GPU will instantely solve the problem.

Using with_cp = True will not work in my case since the VIT_MLA model that I'm using is customized and doesn't have with_cp parameters.

I posted the question here too: https://github.com/LARC-CMU-SMU/FoodSeg103-Benchmark-v1

Thanks you for you again.

Thank you @MengzhangLI .

Better GPU will instantely solve the problem.

Using with_cp = True will not work in my case since the VIT_MLA model that I'm using is customized and doesn't have with_cp parameters.

I posted the question here too: https://github.com/LARC-CMU-SMU/FoodSeg103-Benchmark-v1

Thanks you for you again.

OK, very happy to hear you would not be bothered by GPU memory error.

Do you think it is meaningful to integrate this FoodSeg103 dataset into MMSegmentation and what problems may we meet if we plan to integrate it? We are interested in supporting more datasets.

Best,

I think you can ask @XiongweiWu since he is the owner of the repo, but It should be meaningful I guess, since it is the only dataset with food ingredient segmentation.

@Benybrahim I am very glad to see you have addressed the problem and thx for informing me this information.

@MengzhangLI Hi, First thx for your suggestion! I think the dataset is meaningful since it's the only dataset for fine-grained food ingredient segmentation, and we are also glad to have more researchers involved in this task. I need to carefully discuss with our project leader about it to avoid any license issue and I will update you when we finish.

Hi, @XiongweiWu

Thanks for your nice reply. We do hope we could support this great dataset for community and it absolutely could make more researchers involved in it.

Feel free to contact us anytime.

Best,

@MengzhangLI Hi Mengzhang, I am sorry for replying late since I am involved into one covid-19 positive case recently (luckily I am negative finally...).... I have just saw your email and our project leader also agreed to merge the dataset into the official mmsegmentation repo, but we hope other researcher can still download the dataset via the application form (so that we can trace the download record) and cite our paper if they use the dataset.

@MengzhangLI Hi Mengzhang, I am sorry for replying late since I am involved into one covid-19 positive case recently (luckily I am negative finally...).... I have just saw your email and our project leader also agreed to merge the dataset into the official mmsegmentation repo, but we hope other researcher can still download the dataset via the application form (so that we can trace the download record) and cite our paper if they use the dataset.

Wow, it is pretty good! Thanks for your nice and generous support!

Also very happy to hear you are negative and healthy.

You can see our previous dataset preparation that we strictly follow rules of data usage and user of MMSegmentation must go to original website of dataset to write license and finish registration. So we absolutely meet up your leaders requirements. ;)

Let us keep in touch and hope we could support benchmark models as soon as possible!

@MengzhangLI Hi , sorry for the basic question. I get the same Error regarding GPU memmory when trying to train PSPNet with cityscapes dataset on a Single GPU. I have already changed the SyncBN to BN in the config file. Set batch_size=1 , used with_cp = True and still get a GPU memmory error. GPU model is NVIDIA GTX 1060 Thanks in advance

Here is the log:

20210907_182023.log

and this is what appears after the log in the terminal :

2021-09-07 18:20:27,626 - mmseg - INFO - workflow: [('train', 1)], max: 40000 iters Traceback (most recent call last): File "tools/train.py", line 167, in main() File "tools/train.py", line 156, in main train_segmentor( File "/home/babak/virtualenvs/env3/mmsegmentation/mmseg/apis/train.py", line 120, in train_segmentor runner.run(data_loaders, cfg.workflow) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run iter_runner(iter_loaders[i], kwargs) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train self.call_hook('after_train_iter') File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter runner.outputs['loss'].backward() File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/autograd/init.py", line 125, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA out of memory. Tried to allocate 936.00 MiB (GPU 0; 5.93 GiB total capacity; 2.59 GiB already allocated; 1022.25 MiB free; 2.78 GiB reserved in total by PyTorch)* Exception raised from malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f83f4db71e2 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x1e64b (0x7f83f500d64b in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x1f464 (0x7f83f500e464 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::cuda::CUDACachingAllocator::raw_alloc(unsigned long) + 0x5e (0x7f83f50078de in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4: + 0xeddf66 (0x7f83f60fdf66 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #5: + 0xee2a4c (0x7f83f6102a4c in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #6: + 0xedb41a (0x7f83f60fb41a in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xedbc2e (0x7f83f60fbc2e in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #8: + 0xedc2f0 (0x7f83f60fc2f0 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #9: at::native::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x49 (0x7f83f60fc549 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #10: + 0xf3cf1b (0x7f83f615cf1b in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #11: + 0xf6cb58 (0x7f83f618cb58 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #12: at::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x1ad (0x7f842d0032ad in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #13: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x18a (0x7f83f60f616a in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #14: + 0xf3ce25 (0x7f83f615ce25 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #15: + 0xf6cbb4 (0x7f83f618cbb4 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #16: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f842d012242 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #17: + 0x2ec9c62 (0x7f842ecd5c62 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #18: + 0x2ede224 (0x7f842ecea224 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #19: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f842d012242 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #20: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocator >&&) + 0x258 (0x7f842eb5cc38 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #21: + 0x3375bb7 (0x7f842f181bb7 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #22: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node, torch::autograd::InputBuffer&, std::shared_ptr const&) + 0x1400 (0x7f842f17d400 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #23: torch::autograd::Engine::thread_main(std::shared_ptr const&) + 0x451 (0x7f842f17dfa1 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #24: torch::autograd::Engine::thread_init(int, std::shared_ptr const&, bool) + 0x89 (0x7f842f176119 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #25: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr const&, bool) + 0x4a (0x7f843c91170a in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #26: + 0xd6de4 (0x7f843d845de4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #27: + 0x9609 (0x7f8454f1b609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #28: clone + 0x43 (0x7f8455057293 in /lib/x86_64-linux-gnu/libc.so.6)

Here is the log:

20210907_182023.log

and this is what appears after the log in the terminal :

2021-09-07 18:20:27,626 - mmseg - INFO - workflow: [('train', 1)], max: 40000 iters Traceback (most recent call last): File "tools/train.py", line 167, in main() File "tools/train.py", line 156, in main train_segmentor( File "/home/babak/virtualenvs/env3/mmsegmentation/mmseg/apis/train.py", line 120, in train_segmentor runner.run(data_loaders, cfg.workflow) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run iter_runner(iter_loaders[i], *_kwargs) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train self.call_hook('after_train_iter') File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter runner.outputs['loss'].backward() File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/autograd/init.py", line 125, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA out of memory. Tried to allocate 936.00 MiB (GPU 0; 5.93 GiB total capacity; 2.59 GiB already allocated; 1022.25 MiB free; 2.78 GiB reserved in total by PyTorch) Exception raised from malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f83f4db71e2 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x1e64b (0x7f83f500d64b in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x1f464 (0x7f83f500e464 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::cuda::CUDACachingAllocator::raw_alloc(unsigned long) + 0x5e (0x7f83f50078de in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4: + 0xeddf66 (0x7f83f60fdf66 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #5: + 0xee2a4c (0x7f83f6102a4c in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #6: + 0xedb41a (0x7f83f60fb41a in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xedbc2e (0x7f83f60fbc2e in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #8: + 0xedc2f0 (0x7f83f60fc2f0 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #9: at::native::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x49 (0x7f83f60fc549 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #10: + 0xf3cf1b (0x7f83f615cf1b in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #11: + 0xf6cb58 (0x7f83f618cb58 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #12: at::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x1ad (0x7f842d0032ad in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #13: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x18a (0x7f83f60f616a in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #14: + 0xf3ce25 (0x7f83f615ce25 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #15: + 0xf6cbb4 (0x7f83f618cbb4 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #16: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f842d012242 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #17: + 0x2ec9c62 (0x7f842ecd5c62 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #18: + 0x2ede224 (0x7f842ecea224 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #19: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f842d012242 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #20: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x258 (0x7f842eb5cc38 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #21: + 0x3375bb7 (0x7f842f181bb7 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #22: torch::autograd::Engine::evaluate_function(std::sharedptrtorch::autograd::GraphTask&, torch::autograd::Node, torch::autograd::InputBuffer&, std::shared_ptrtorch::autograd::ReadyQueue const&) + 0x1400 (0x7f842f17d400 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #23: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&) + 0x451 (0x7f842f17dfa1 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #24: torch::autograd::Engine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x89 (0x7f842f176119 in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #25: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptrtorch::autograd::ReadyQueue const&, bool) + 0x4a (0x7f843c91170a in /home/babak/virtualenvs/env3/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #26: + 0xd6de4 (0x7f843d845de4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #27: + 0x9609 (0x7f8454f1b609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #28: clone + 0x43 (0x7f8455057293 in /lib/x86_64-linux-gnu/libc.so.6)

I think it is caused by small GPU memory. Maybe you can try some tiny models with ResNet18 backbone.

Looking forward to your feedback.

@MengzhangLI Thank you so much for the quick response and help. I tried PSPNet, Deeplabv3 and FCN , all of them with ResNet backbone with depth 18. According to the config tables all three of them are suppose to use GPU memory less than 2 GBs. Strangely I got the same GPU memory error with PSPNet and Deeplabv3 but FCN is working now. Although different models but with the same backbone and depth the allocated memory should be on the same level right? Unless the listed memory usage in the config tables is per GPU (you used 4 in your trainings if I'm correct) but that should have been modified with the batch_size. is there a way that I can modify memory usage and trade it with speed? the card that I have now has 6 GB of memory and still problematic. I appreciate it very much if you give me your insights on this.

@MengzhangLI Thank you so much for the quick response and help. I tried PSPNet, Deeplabv3 and FCN , all of them with ResNet backbone with depth 18. According to the config tables all three of them are suppose to use GPU memory less than 2 GBs. Strangely I got the same GPU memory error with PSPNet and Deeplabv3 but FCN is working now. Although different models but with the same backbone and depth the allocated memory should be on the same level right? Unless the listed memory usage in the config tables is per GPU (you used 4 in your trainings if I'm correct) but that should have been modified with the batch_size. is there a way that I can modify memory usage and trade it with speed? the card that I have now has 6 GB of memory and still problematic. I appreciate it very much if you give me your insights on this.

Maybe you could try FP16, see this config for example: https://github.com/open-mmlab/mmsegmentation/blob/master/configs/bisenetv2/bisenetv2_fcn_fp16_4x4_1024x1024_160k_cityscapes.py

open-mmlab / mmsegmentation

RuntimeError: CUDA out of memory. #832