Closed rassabin closed 4 years ago
Hi @rassabin
For GPU memory usage, you may set cudnn_benchmark=False
for more precise profile.
As for the illegal memory access, you may try to use PyTorch 1.5 to see if it works.
Hi @rassabin @xvjiarui Thanks to MMlab for providing such a good semantic segmentation framework. When I used a custom data set, I also encountered the same problem. Due to the extremely strong encapsulation feature of mmsegmentation, I spent at least 3 days and failed to locate the problem.
2020-08-02 17:45:33,618 - mmseg - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GPU 0: GeForce RTX 2060 SUPER
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.5.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.6.0a0+82fd1c8
OpenCV: 4.3.0
MMCV: 1.0.4
MMSegmentation: 0.5.0+2b801de
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.1
2020-08-02 17:37:38,478 - mmseg - INFO - Distributed training: True
2020-08-02 17:37:38,682 - mmseg - INFO - Config:
norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
type='EncoderDecoder',
pretrained='open-mmlab://resnet50_v1c',
backbone=dict(
type='ResNetV1c',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
dilations=(1, 1, 2, 4),
strides=(1, 2, 1, 1),
norm_cfg=dict(type='SyncBN', requires_grad=True),
norm_eval=False,
style='pytorch',
contract_dilation=True),
decode_head=dict(
type='FCNHead',
in_channels=2048,
in_index=3,
channels=512,
num_convs=2,
concat_input=True,
dropout_ratio=0.1,
num_classes=17,
norm_cfg=dict(type='SyncBN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
auxiliary_head=dict(
type='FCNHead',
in_channels=1024,
in_index=2,
channels=256,
num_convs=1,
concat_input=False,
dropout_ratio=0.1,
num_classes=17,
norm_cfg=dict(type='SyncBN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)))
train_cfg = dict()
test_cfg = dict(mode='whole')
dataset_type = 'SkmtDataset'
data_root = 'data/VOCdevkit/Seg/skmt5'
img_norm_cfg = dict(
mean=[34.73, 34.81, 34.45], std=[13.96, 13.93, 14.05], to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[34.73, 34.81, 34.45],
std=[13.96, 13.93, 14.05],
to_rgb=True),
dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[34.73, 34.81, 34.45],
std=[13.96, 13.93, 14.05],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=4,
workers_per_gpu=4,
train=dict(
type='SkmtDataset',
data_root='data/VOCdevkit/Seg/skmt5',
img_dir='JPEGImages',
ann_dir='SegmentationClass',
split='ImageSets/Segmentation/train.txt',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[34.73, 34.81, 34.45],
std=[13.96, 13.93, 14.05],
to_rgb=True),
dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]),
val=dict(
type='SkmtDataset',
data_root='data/VOCdevkit/Seg/skmt5',
img_dir='JPEGImages',
ann_dir='SegmentationClass',
split='ImageSets/Segmentation/val.txt',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[34.73, 34.81, 34.45],
std=[13.96, 13.93, 14.05],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='SkmtDataset',
data_root='data/VOCdevkit/Seg/skmt5',
img_dir='JPEGImages',
ann_dir='SegmentationClass',
split='ImageSets/Segmentation/val.txt',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[34.73, 34.81, 34.45],
std=[13.96, 13.93, 14.05],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
log_config = dict(
interval=50, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
total_iters = 20000
checkpoint_config = dict(by_epoch=False, interval=2000)
evaluation = dict(interval=2000, metric='mIoU')
work_dir = './work_dirs/fcn_r50-d8_512x512_20k_voc12aug'
gpu_ids = range(0, 1)
2020-08-02 17:45:34,250 - mmseg - INFO - Loaded 107 images
2020-08-02 17:45:35,864 - mmseg - INFO - Loaded 107 images
2020-08-02 17:45:35,864 - mmseg - INFO - Start running, host: liuxin@liuxin, work_dir: /media/Program/CV/Project/SKMT/mmsegmentation/work_dirs/fcn_r50-d8_512x512_20k_voc12aug
2020-08-02 17:45:35,864 - mmseg - INFO - workflow: [('train', 1)], max: 20000 iters
Traceback (most recent call last):
File "tools/train.py", line 159, in <module>
main()
File "tools/train.py", line 155, in main
meta=meta)
File "/media/Program/CV/Project/SKMT/mmsegmentation/mmseg/apis/train.py", line 105, in train_segmentor
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/media/Program/CV/Project/mmcv/mmcv/runner/iter_based_runner.py", line 119, in run
iter_runner(iter_loaders[i], **kwargs)
File "/media/Program/CV/Project/mmcv/mmcv/runner/iter_based_runner.py", line 55, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/media/Program/CV/Project/mmcv/mmcv/parallel/distributed.py", line 36, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/media/Program/CV/Project/SKMT/mmsegmentation/mmseg/models/segmentors/base.py", line 153, in train_step
loss, log_vars = self._parse_losses(losses)
File "/media/Program/CV/Project/SKMT/mmsegmentation/mmseg/models/segmentors/base.py", line 204, in _parse_losses
log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /opt/conda/conda-bld/pytorch_1587428398394/work/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f7201a47b5e in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x6d0 (0x7f7201802e30 in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7201a356ed in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x51ee0a (0x7f722ea62e0a in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x1a311e (0x557099a8f11e in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #5: <unknown function> + 0xfdfc8 (0x5570999e9fc8 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #6: <unknown function> + 0x10f147 (0x5570999fb147 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #7: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #8: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #9: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #10: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #11: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #12: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #13: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #14: PyDict_SetItem + 0x502 (0x557099a50172 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #15: PyDict_SetItemString + 0x4f (0x557099a50c4f in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #16: PyImport_Cleanup + 0xa0 (0x557099a95760 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #17: Py_FinalizeEx + 0x67 (0x557099b10817 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #18: <unknown function> + 0x2373d3 (0x557099b233d3 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #19: _Py_UnixMain + 0x3c (0x557099b236fc in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #20: __libc_start_main + 0xe7 (0x7f7250c9bb97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: <unknown function> + 0x1dc3c0 (0x557099ac83c0 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
Traceback (most recent call last):
File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/liuxin/anaconda3/envs/mmlab/bin/python', '-u', 'tools/train.py', '--local_rank=0', 'configs/fcn/fcn_r50-d8_512x512_20k_voc12aug.py', '--launcher', 'pytorch']' died with <Signals.SIGABRT: 6>.
I use my custom dataset(the class_num is 17(including background);original img is RGB img ; gt_seg_map is P mode of PIL.)
I solved this problem by modifying the label picture to one channel map which pixels consisted by class_id.
Hi @UESTC-Liuxin Thanks for the information. I will add some doc about the label picture.
Hi, did you add it to the doc. ? I encountered the same problem.
At the beginning, I ran with 1 sample per gpu and it worked fine.
data = dict( samples_per_gpu=1, workers_per_gpu=4, train=dict( type=dataset_type, data_root=data_root, img_dir='images/training', ann_dir='annotations/training', pipeline=train_pipeline),
After I changed the setting to 2 samples per gpu, I raises the error.
data = dict( samples_per_gpu=2, workers_per_gpu=4, train=dict( type=dataset_type, data_root=data_root, img_dir='images/training', ann_dir='annotations/training', pipeline=train_pipeline),
The error:
File "/home/ubuntu/anaconda3/envs/exp/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 36, in train_step output = self.module.train_step(*inputs[0], *kwargs[0]) File "/home/ubuntu/work/models/model.py", line 153, in train_step loss, log_vars = self._parse_losses(losses) File "/home/ubuntu/work/models/model.py", line 193, in _parse_losses log_vars[loss_name] = loss_value.item() RuntimeError: CUDA error: an illegal memory access was encountered terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered (insert_events at /opt/conda/conda-bld/pytorch_1591914855613/work/c10/cuda/CUDACachingAllocator.cpp:771) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fd854c0fb5e in /home/ubuntu/anaconda3/envs/exp/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x6d0 (0x7fd854e54e30 in /home/ubuntu/anaconda3/envs/exp/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
Any ideas of that? Thank you very much~
Hi @fangruizhu I created a PR to add docs. Are you using distributed training or not? Which dataset are you using?
Thanks for the quick reply! @xvjiarui Yes, I use distributed training on ADE20K data. I used my own model, while the code is based on 'https://github.com/open-mmlab/mmsegmentation/blob/master/mmseg/models/segmentors/encoder_decoder.py'. I'm quite confused where the error might be, since running with the config 'https://github.com/open-mmlab/mmsegmentation/blob/master/configs/fcn/fcn_r50-d8_512x512_80k_ade20k.py' is okay for me.
Hi @fangruizhu You may check the following.
num_classes
of your own model. Thanks! @xvjiarui
It seems quite weird. The num_classes
is 150, the same as in ADE20k general setting. And it works fine with 1 image/GPU by distributed training.
When I trained with a single gpu with multiple samples on it, it got the same error. The error occurs at https://github.com/open-mmlab/mmsegmentation/blob/3e49d0ad7174c062cb4a3d72eb8e71b94d5ba0fd/mmseg/models/segmentors/base.py#L204. The submodule of my model is directly derived from nn.Module and are not registered with register_module
. Would that cause problems?
Hi @fangruizhu
You may try if the original configs provided in the repo works in 2 images/GPU, e.g. pspnet_r50-d8_512x512_160k_ade20k.py.
If so, may I have a look at your config file?
On the other hand, sometimes when CrossEntropy
gets the incorrect match of input and label, there will be CUDA error.
Thanks! @xvjiarui I fixed the error by re-installing the mmcv lite version, disabling CUDA complier in mmcv. I think maybe the error was caused by CUDA environment.
because we use labelme,and some Data Annotator write the wrong new label,account for 11 classes ,rather than 10 classes.......
check your labels and the dataset configs if you encountered this error.
In my case, I used my custom dataset and used config
python tools/train.py configs/deeplabv3plus/deeplabv3plus_r101-d8_512x512_160k_ade20k.py
What was a trap is that we have to change num_classes in
model = dict(
decode_head=dict(num_classes=150), auxiliary_head=dict(num_classes=150))
in configs/deeplabv3plus/deeplabv3plus_r50-d8_512x512_160k_ade20k.py, as well as num_classes in configs/base/models/deeplabv3plus_r50-d8.py.
The config style is slightly different depending on a dataset type you use. Moreover, the same information (num_classes, in my case) appears here and there in a set of relevant config files. So we really have to be careful whether we have completely modified configs according to our custom datasets...
Hello,
I would like to reopen this issue because I have exactly the same problem described above.
I want to run the segmentation training code for cityscapes as follow with 1 GPU. However, I obtain this error :
(/data/users/agolebiewski/conda-envs/segformer) python tools/train.py local_configs/segformer/B1/segformer.b1.1024x1024.city.160k.py --gpus 1
2023-05-22 11:20:22,493 - mmseg - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.16 | packaged by conda-forge | (default, Feb 1 2023, 16:01:55) [GCC 11.3.0]
CUDA available: True
GPU 0,1,2: Tesla V100-SXM2-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.7.r11.7/compiler.31294372_0
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2022.1-Product Build 20220311 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.2
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.5
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.7.0
OpenCV: 4.5.1
MMCV: 1.2.7
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMSegmentation: 0.11.0+1a8ad51
------------------------------------------------------------
2023-05-22 11:20:22,493 - mmseg - INFO - Distributed training: False
2023-05-22 11:20:22,669 - mmseg - INFO - Config:
norm_cfg = dict(type='BN', requires_grad=True)
find_unused_parameters = True
model = dict(
type='EncoderDecoder',
pretrained='pretrained/mit_b1.pth',
backbone=dict(type='mit_b1', style='pytorch'),
decode_head=dict(
type='SegFormerHead',
in_channels=[64, 128, 320, 512],
in_index=[0, 1, 2, 3],
feature_strides=[4, 8, 16, 32],
channels=128,
dropout_ratio=0.1,
num_classes=19,
norm_cfg=dict(type='BN', requires_grad=True),
align_corners=False,
decoder_params=dict(embed_dim=256),
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
train_cfg=dict(),
test_cfg=dict(mode='slide', crop_size=(1024, 1024), stride=(768, 768)))
dataset_type = 'CityscapesDataset'
data_root = 'data/cityscapes/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (1024, 1024)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(type='Resize', img_scale=(2048, 1024), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(1024, 1024), cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size=(1024, 1024), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 1024),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
type='RepeatDataset',
times=500,
dataset=dict(
type='CityscapesDataset',
data_root='data/cityscapes/',
img_dir='leftImg8bit/train',
ann_dir='gtFine/train',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(
type='Resize',
img_scale=(2048, 1024),
ratio_range=(0.5, 2.0)),
dict(
type='RandomCrop',
crop_size=(1024, 1024),
cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(
type='Pad', size=(1024, 1024), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
])),
val=dict(
type='CityscapesDataset',
data_root='data/cityscapes/',
img_dir='leftImg8bit/val',
ann_dir='gtFine/val',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 1024),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='CityscapesDataset',
data_root='data/cityscapes/',
img_dir='leftImg8bit/val',
ann_dir='gtFine/val',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 1024),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
log_config = dict(
interval=50,
hooks=[
dict(type='TextLoggerHook', by_epoch=False),
dict(type='TensorboardLoggerHook')
])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(
type='AdamW',
lr=6e-05,
betas=(0.9, 0.999),
weight_decay=0.01,
paramwise_cfg=dict(
custom_keys=dict(
pos_block=dict(decay_mult=0.0),
norm=dict(decay_mult=0.0),
head=dict(lr_mult=10.0))))
optimizer_config = dict()
lr_config = dict(
policy='poly',
warmup='linear',
warmup_iters=1500,
warmup_ratio=1e-06,
power=1.0,
min_lr=0.0,
by_epoch=False)
runner = dict(type='IterBasedRunner', max_iters=160000)
checkpoint_config = dict(by_epoch=False, interval=4000)
evaluation = dict(interval=4000, metric='mIoU')
work_dir = './work_dirs/segformer.b1.1024x1024.city.160k'
gpu_ids = range(0, 1)
2023-05-22 11:20:23,032 - mmseg - INFO - Use load_from_local loader
2023-05-22 11:20:23,076 - mmseg - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: head.weight, head.bias
2023-05-22 11:20:23,078 - mmseg - INFO - EncoderDecoder(
(backbone): mit_b1(
(patch_embed1): OverlapPatchEmbed(
(proj): Conv2d(3, 64, kernel_size=(7, 7), stride=(4, 4), padding=(3, 3))
(norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
)
(patch_embed2): OverlapPatchEmbed(
(proj): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
)
(patch_embed3): OverlapPatchEmbed(
(proj): Conv2d(128, 320, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
)
(patch_embed4): OverlapPatchEmbed(
(proj): Conv2d(320, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(block1): ModuleList(
(0): Block(
(norm1): LayerNorm((64,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(q): Linear(in_features=64, out_features=64, bias=True)
(kv): Linear(in_features=64, out_features=128, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=64, out_features=64, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(sr): Conv2d(64, 64, kernel_size=(8, 8), stride=(8, 8))
(norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
)
(drop_path): Identity()
(norm2): LayerNorm((64,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=64, out_features=256, bias=True)
(dwconv): DWConv(
(dwconv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
)
(act): GELU()
(fc2): Linear(in_features=256, out_features=64, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): Block(
(norm1): LayerNorm((64,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(q): Linear(in_features=64, out_features=64, bias=True)
(kv): Linear(in_features=64, out_features=128, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=64, out_features=64, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(sr): Conv2d(64, 64, kernel_size=(8, 8), stride=(8, 8))
(norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
)
(drop_path): DropPath()
(norm2): LayerNorm((64,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=64, out_features=256, bias=True)
(dwconv): DWConv(
(dwconv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256)
)
(act): GELU()
(fc2): Linear(in_features=256, out_features=64, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(norm1): LayerNorm((64,), eps=1e-06, elementwise_affine=True)
(block2): ModuleList(
(0): Block(
(norm1): LayerNorm((128,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(q): Linear(in_features=128, out_features=128, bias=True)
(kv): Linear(in_features=128, out_features=256, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=128, out_features=128, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(sr): Conv2d(128, 128, kernel_size=(4, 4), stride=(4, 4))
(norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
)
(drop_path): DropPath()
(norm2): LayerNorm((128,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=128, out_features=512, bias=True)
(dwconv): DWConv(
(dwconv): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=512)
)
(act): GELU()
(fc2): Linear(in_features=512, out_features=128, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): Block(
(norm1): LayerNorm((128,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(q): Linear(in_features=128, out_features=128, bias=True)
(kv): Linear(in_features=128, out_features=256, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=128, out_features=128, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(sr): Conv2d(128, 128, kernel_size=(4, 4), stride=(4, 4))
(norm): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
)
(drop_path): DropPath()
(norm2): LayerNorm((128,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=128, out_features=512, bias=True)
(dwconv): DWConv(
(dwconv): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=512)
)
(act): GELU()
(fc2): Linear(in_features=512, out_features=128, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(norm2): LayerNorm((128,), eps=1e-06, elementwise_affine=True)
(block3): ModuleList(
(0): Block(
(norm1): LayerNorm((320,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(q): Linear(in_features=320, out_features=320, bias=True)
(kv): Linear(in_features=320, out_features=640, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=320, out_features=320, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(sr): Conv2d(320, 320, kernel_size=(2, 2), stride=(2, 2))
(norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
)
(drop_path): DropPath()
(norm2): LayerNorm((320,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=320, out_features=1280, bias=True)
(dwconv): DWConv(
(dwconv): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=1280)
)
(act): GELU()
(fc2): Linear(in_features=1280, out_features=320, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): Block(
(norm1): LayerNorm((320,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(q): Linear(in_features=320, out_features=320, bias=True)
(kv): Linear(in_features=320, out_features=640, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=320, out_features=320, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(sr): Conv2d(320, 320, kernel_size=(2, 2), stride=(2, 2))
(norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
)
(drop_path): DropPath()
(norm2): LayerNorm((320,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=320, out_features=1280, bias=True)
(dwconv): DWConv(
(dwconv): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=1280)
)
(act): GELU()
(fc2): Linear(in_features=1280, out_features=320, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(norm3): LayerNorm((320,), eps=1e-06, elementwise_affine=True)
(block4): ModuleList(
(0): Block(
(norm1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(q): Linear(in_features=512, out_features=512, bias=True)
(kv): Linear(in_features=512, out_features=1024, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=512, out_features=512, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(dwconv): DWConv(
(dwconv): Conv2d(2048, 2048, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2048)
)
(act): GELU()
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): Block(
(norm1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(q): Linear(in_features=512, out_features=512, bias=True)
(kv): Linear(in_features=512, out_features=1024, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=512, out_features=512, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): DropPath()
(norm2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=512, out_features=2048, bias=True)
(dwconv): DWConv(
(dwconv): Conv2d(2048, 2048, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2048)
)
(act): GELU()
(fc2): Linear(in_features=2048, out_features=512, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(norm4): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
)
(decode_head): SegFormerHead(
input_transform=multiple_select, ignore_index=255, align_corners=False
(loss_decode): CrossEntropyLoss()
(conv_seg): Conv2d(128, 19, kernel_size=(1, 1), stride=(1, 1))
(dropout): Dropout2d(p=0.1, inplace=False)
(linear_c4): MLP(
(proj): Linear(in_features=512, out_features=256, bias=True)
)
(linear_c3): MLP(
(proj): Linear(in_features=320, out_features=256, bias=True)
)
(linear_c2): MLP(
(proj): Linear(in_features=128, out_features=256, bias=True)
)
(linear_c1): MLP(
(proj): Linear(in_features=64, out_features=256, bias=True)
)
(linear_fuse): ConvModule(
(conv): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(linear_pred): Conv2d(256, 19, kernel_size=(1, 1), stride=(1, 1))
)
)
2023-05-22 11:20:23,130 - mmseg - INFO - Loaded 2975 images
2023-05-22 11:20:25,113 - mmseg - INFO - Loaded 500 images
2023-05-22 11:20:25,113 - mmseg - INFO - Start running, host: d624032@rosetta-c4140gpu02, work_dir: /gpfs_new/scratch/users/agolebiewski/SegFormer/work_dirs/segformer.b1.1024x1024.city.160k
2023-05-22 11:20:25,113 - mmseg - INFO - workflow: [('train', 1)], max: 160000 iters
[W TensorIterator.cpp:924] Warning: Mixed memory format inputs detected while calling the operator. The operator will output channels_last tensor even if some of the inputs are not in channels_last format. (function operator())
Traceback (most recent call last):
File "tools/train.py", line 166, in <module>
main()
File "tools/train.py", line 155, in main
train_segmentor(
File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/apis/train.py", line 115, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 131, in run
iter_runner(iter_loaders[i], **kwargs)
File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/models/segmentors/base.py", line 153, in train_step
loss, log_vars = self._parse_losses(losses)
File "/gpfs_new/scratch/users/agolebiewski/SegFormer/mmseg/models/segmentors/base.py", line 204, in _parse_losses
log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629395347/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7feaf8dd377d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7feaf9023d9d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7feaf8dbfb1d in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53956b (0x7feb3693156b in /data/users/agolebiewski/conda-envs/segformer/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #21: __libc_start_main + 0xf5 (0x7feb63b7a555 in /lib64/libc.so.6)
Aborted
The error occurs at the same location (like @xvjiarui)
I attempted to follow the others issues comments.
But I always obtain this CUDA error: an illegal memory access was encountered
...
Thanks! @xvjiarui It seems quite weird. The
num_classes
is 150, the same as in ADE20k general setting. And it works fine with 1 image/GPU by distributed training.When I trained with a single gpu with multiple samples on it, it got the same error. The error occurs at
. The submodule of my model is directly derived from nn.Module and are not registered with
register_module
. Would that cause problems?
form the ' log_vars[loss_name] = loss_value.item()'
,i think it happens for the wrong index in labels in the dataset enhancing process.
checking you labels:
as for custom dataset, and not ignoring the background, for padding process, adding another index for the padding elements in case conflicting with the ids would be calculated in you loss function.
for example: for custom dataset, an very common reason for this error is setting the wrong value for padding elements, and solving by : the dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255) -> dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=-100),
Error was encountered during training process with condfigs:
The script take an approximately 4-5GB of GPU from 11GB available and return this error:
ERROR
But if i reduce the size the image size twice with the same images per GPU (2) ,script takes approxiamtely 2GB from GPU and everything works fine. Also,i want to add that using another PyTorch script with my own Dataloader i'm able to fill in GPU on full (11GB) by training process with the same Torch version and the same hardware.