open-mmlab / mmsegmentation

OpenMMLab Semantic Segmentation Toolbox and Benchmark.
https://mmsegmentation.readthedocs.io/en/main/
Apache License 2.0
7.94k stars 2.57k forks source link

Training ADE20K using ./dist_train.py won't allow me to evaluate mid-training #2235

Open AndrewTKent opened 1 year ago

AndrewTKent commented 1 year ago
  1. I have searched related issues but cannot get the expected help. Yes
  2. The bug has not been fixed in the latest version. Yes

Describe the bug A clear and concise description of what the bug is.

As the title suggests, when training via dist_train.sh I am not able to get past the evaluation stage of the training process while just running my model on a single GPU I am. This error pops up:

_FileNotFoundError: [Errno 2] No such file or directory: '.dist_test/tmpm3f7a8xg/part1.pkl' Traceback (most recent call last):

Here is the full output:

_2022-10-26 20:56:24,265 - mmseg - INFO - Iter [190/160000] lr: 1.807e-07, eta: 2 days, 8:58:32, time: 1.229, data_time: 0.008, memory: 15900, decode.loss_cls: 5.1733, decode.loss_mask: 2.3283, decode.loss_dice: 4.2889, decode.d0.loss_cls: 10.3644, decode.d0.loss_mask: 1.8888, decode.d0.loss_dice: 3.6292, decode.d1.loss_cls: 5.3732, decode.d1.loss_mask: 1.9898, decode.d1.loss_dice: 3.6457, decode.d2.loss_cls: 5.0473, decode.d2.loss_mask: 1.9692, decode.d2.loss_dice: 3.7876, decode.d3.loss_cls: 5.1667, decode.d3.loss_mask: 1.9582, decode.d3.loss_dice: 3.9707, decode.d4.loss_cls: 5.1105, decode.d4.loss_mask: 2.0531, decode.d4.loss_dice: 4.0650, decode.d5.loss_cls: 4.9984, decode.d5.loss_mask: 2.2159, decode.d5.loss_dice: 4.0660, decode.d6.loss_cls: 4.9207, decode.d6.loss_mask: 2.2972, decode.d6.loss_dice: 4.1253, decode.d7.loss_cls: 4.8610, decode.d7.loss_mask: 2.2477, decode.d7.loss_dice: 4.1793, decode.d8.loss_cls: 4.9706, decode.d8.loss_mask: 2.3507, decode.d8.loss_dice: 4.2628, loss: 117.3053

2022-10-26 20:56:37,293 - mmseg - INFO - Iter [195/160000] lr: 1.855e-07, eta: 2 days, 8:54:33, time: 1.227, data_time: 0.008, memory: 15900, decode.loss_cls: 4.7057, decode.loss_mask: 2.3921, decode.loss_dice: 4.3334, decode.d0.loss_cls: 10.3254, decode.d0.loss_mask: 1.9261, decode.d0.loss_dice: 3.6466, decode.d1.loss_cls: 4.9439, decode.d1.loss_mask: 1.9303, decode.d1.loss_dice: 3.7556, decode.d2.loss_cls: 4.6303, decode.d2.loss_mask: 1.9778, decode.d2.loss_dice: 3.7929, decode.d3.loss_cls: 4.7008, decode.d3.loss_mask: 2.0036, decode.d3.loss_dice: 3.8613, decode.d4.loss_cls: 4.7067, decode.d4.loss_mask: 2.1614, decode.d4.loss_dice: 3.9415, decode.d5.loss_cls: 4.5411, decode.d5.loss_mask: 2.4803, decode.d5.loss_dice: 4.0049, decode.d6.loss_cls: 4.5642, decode.d6.loss_mask: 2.5168, decode.d6.loss_dice: 4.0441, decode.d7.loss_cls: 4.5295, decode.d7.loss_mask: 2.4627, decode.d7.loss_dice: 4.0911, decode.d8.loss_cls: 4.6387, decode.d8.loss_mask: 2.3198, decode.d8.loss_dice: 4.2091, loss: 114.1377 2022-10-26 20:56:50,328 - mmseg - INFO - Saving checkpoint at 200 iterations [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 10/10, 0.1 task/s, elapsed: 116s, ETA: 0sTraceback (most recent call last): File "./train.py", line 224, in main() File "./train.py", line 213, in main train_segmentor( File "/home/ubuntu/.local/lib/python3.8/site-packages/mmseg/apis/train.py", line 167, in train_segmentor runner.run(data_loaders, cfg.workflow) File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], *kwargs) File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py", line 67, in train self.call_hook('after_train_iter') File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/base_runner.py", line 309, in call_hook getattr(hook, fn_name)(self) File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/hooks/evaluation.py", line 259, in after_train_iter hook.after_train_iter(runner) File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 129, in wrapper return func(args, **kwargs) File "/home/ubuntu/progress_tracking/ViT-Adapter/mmseg_custom/core/hook/wandblogger_hook.py", line 206, in after_train_iter results = self.test_fn(runner.model, self.eval_hook.dataloader) File "/home/ubuntu/.local/lib/python3.8/site-packages/mmseg/apis/test.py", line 232, in multi_gpu_test results = collect_results_cpu(results, len(dataset), tmpdir) File "/usr/local/lib/python3.8/dist-packages/mmcv/engine/test.py", line 139, in collect_results_cpu part_result = mmcv.load(part_file) File "/usr/local/lib/python3.8/dist-packages/mmcv/fileio/io.py", line 60, in load with BytesIO(file_client.get(file)) as f: File "/usr/local/lib/python3.8/dist-packages/mmcv/fileio/file_client.py", line 993, in get return self.client.get(filepath) File "/usr/local/lib/python3.8/dist-packages/mmcv/fileio/file_client.py", line 518, in get with open(filepath, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '.dist_test/tmpm3f7a8xg/part1.pkl' Traceback (most recent call last):

Reproduction

  1. What command or script did you run?

_CUDA_VISIBLE_DEVICES=0,1 PORT=29500 ./dist_train.sh configs/ade20k/mask2former_beit_adapter_large_640_160k_ade20kms.py 2

  1. Did you make any modifications on the code or config? Did you understand what you have modified?

No.

Here is the config I'm using:

_base = [ '../base/models/mask2former_beit.py', '../base/datasets/ade20k.py', '../base/default_runtime.py', '../base/schedules/schedule_160k.py' ] crop_size = (640, 640) pretrained = 'https://conversationhub.blob.core.windows.net/beit-share-public/beit/beit_large_patch16_224_pt22k_ft22k.pth' model = dict( pretrained=pretrained, backbone=dict( type='BEiTAdapter', img_size=640, patch_size=16, embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4, qkv_bias=True, use_abs_pos_emb=False, use_rel_pos_bias=True, init_values=1e-6, drop_path_rate=0.3, conv_inplane=64, n_points=4, deform_num_heads=16, cffn_ratio=0.25, deform_ratio=0.5, with_cp=True, # set with_cp=True to save memory interaction_indexes=[[0, 5], [6, 11], [12, 17], [18, 23]], ), decode_head=dict( in_channels=[1024, 1024, 1024, 1024], feat_channels=1024, out_channels=1024, num_queries=100, pixel_decoder=dict( type='MSDeformAttnPixelDecoder', num_outs=3, norm_cfg=dict(type='GN', num_groups=32), act_cfg=dict(type='ReLU'), encoder=dict( type='DetrTransformerEncoder', num_layers=6, transformerlayers=dict( type='BaseTransformerLayer', attn_cfgs=dict( type='MultiScaleDeformableAttention', embed_dims=1024, num_heads=32, num_levels=3, num_points=4, im2col_step=64, dropout=0.0, batch_first=False, norm_cfg=None, init_cfg=None), ffn_cfgs=dict( type='FFN', embed_dims=1024, feedforward_channels=4096, num_fcs=2, ffn_drop=0.0, with_cp=True, # set with_cp=True to save memory act_cfg=dict(type='ReLU', inplace=True)), operation_order=('self_attn', 'norm', 'ffn', 'norm')), init_cfg=None), positional_encoding=dict( type='SinePositionalEncoding', num_feats=512, normalize=True), init_cfg=None), positional_encoding=dict( type='SinePositionalEncoding', num_feats=512, normalize=True), transformer_decoder=dict( type='DetrTransformerDecoder', return_intermediate=True, num_layers=9, transformerlayers=dict( type='DetrTransformerDecoderLayer', attn_cfgs=dict( type='MultiheadAttention', embed_dims=1024, num_heads=32, attn_drop=0.0, proj_drop=0.0, dropout_layer=None, batch_first=False), ffn_cfgs=dict( embed_dims=1024, feedforward_channels=4096, num_fcs=2, act_cfg=dict(type='ReLU', inplace=True), ffn_drop=0.0, dropout_layer=None, with_cp=True, # set with_cp=True to save memory add_identity=True), feedforward_channels=4096, operation_order=('cross_attn', 'norm', 'self_attn', 'norm', 'ffn', 'norm')), init_cfg=None) ), test_cfg=dict(mode='slide', crop_size=crop_size, stride=(426, 426)) ) img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', reduce_zero_label=True), dict(type='Resize', img_scale=(2048, 640), ratio_range=(0.5, 2.0)), dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75), dict(type='RandomFlip', prob=0.5), dict(type='PhotoMetricDistortion'), dict(type='Normalize', img_norm_cfg), dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255), dict(type='ToMask'), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_semantic_seg', 'gt_masks', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(2048, 640), img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75], flip=True, transforms=[ dict(type='SETR_Resize', keep_ratio=True, crop_size=crop_size, setr_multi_scale=True), dict(type='RandomFlip'), dict(type='Normalize', img_norm_cfg), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']), ]) ] optimizer = dict(delete=True, type='AdamW', lr=2e-5, betas=(0.9, 0.999), weight_decay=0.05, constructor='LayerDecayOptimizerConstructor', paramwise_cfg=dict(num_layers=24, layer_decay_rate=0.90)) lr_config = dict(delete=True, policy='poly', warmup='linear', warmup_iters=1500, warmup_ratio=1e-6, power=1.0, min_lr=0.0, by_epoch=False)

log_config = dict( interval=5, hooks=[ dict(type='MMSegWandbHook', init_kwargs={ 'entity': "nexterarobotics", 'project': "Progress_Tracking_V1", 'name': "mask2former_beit_adapter_large_896_80k_ade20k_ss_V0.1"}, by_epoch=False, num_eval_images = 2), dict(type='TextLoggerHook', by_epoch=False), ])

data = dict(samples_per_gpu=1, train=dict(pipeline=train_pipeline), val=dict(pipeline=test_pipeline), test=dict(pipeline=test_pipeline)) runner = dict(type='IterBasedRunner') checkpoint_config = dict(by_epoch=False, interval=100, max_keep_ckpts=1) evaluation = dict(interval=200, metric='mIoU', savebest='mIoU')

  1. What dataset did you use?

ADE20K

  1. Please run python mmseg/utils/collect_env.py to collect necessary environment information and paste it here.

python3 collect_env.py sys.platform: linux Python: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] CUDA available: True GPU 0,1,2,3: NVIDIA A10G CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.6, V11.6.124 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.9.0+cu111 PyTorch compiling details: PyTorch built with:

TorchVision: 0.10.0+cu111 OpenCV: 4.6.0 MMCV: 1.5.0 MMCV Compiler: GCC 9.4 MMCV CUDA Compiler: 11.6 MMSegmentation: 0.20.2+ad38cbe

MeowZheng commented 1 year ago

based on the error log

File "/home/ubuntu/.local/lib/python3.8/site-packages/mmseg/apis/test.py", line 232, in multi_gpu_test
results = collect_results_cpu(results, len(dataset), tmpdir)
File "/usr/local/lib/python3.8/dist-packages/mmcv/engine/test.py", line 139, in collect_results_cpu
part_result = mmcv.load(part_file)
File "/usr/local/lib/python3.8/dist-packages/mmcv/fileio/io.py", line 60, in load
with BytesIO(file_client.get(file)) as f:
File "/usr/local/lib/python3.8/dist-packages/mmcv/fileio/file_client.py", line 993, in get
return self.client.get(filepath)
File "/usr/local/lib/python3.8/dist-packages/mmcv/fileio/file_client.py", line 518, in get
with open(filepath, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '.dist_test/tmpm3f7a8xg/part_1.pkl'
Traceback (most recent call last):_

I think there might be some problem when collecting results from all rank, and you might try to define a sync folder to collect the results, like

evaluation = dict(interval=200, metric='mIoU', save_best='mIoU', tmpdir=SYNCFOLDER)

If you still meet this error, please try to collect results with GPU, like

evaluation = dict(interval=200, metric='mIoU', save_best='mIoU', gpu_collect=True)
AndrewTKent commented 1 year ago

@MeowZheng Okay gotcha, I have some models training right now on my GPU's so I'll give this a shot in a day or two. Thank-you very much for getting back to me.

AndrewTKent commented 1 year ago

@MeowZheng So I actually found that the issue stemmed from using MMSegWandbHook, when I didn't have it as one of my hooks my run didn't have any issues evaluating the run. I'm going to run some more tests and see if maybe I just shouldn't try and log during the same iteration as the evaluation.

MeowZheng commented 1 year ago

When using MMSegWandbHook, the evaluation will do twice, and the second one is for wandb to log the evaluation results.

I suggest we can talk about all problems of wandbhook usage in #2236, as I have invited the engineer from wandb to check it.