在使用多卡训练VID模型时,验证到最后几张图篇时发生卡顿(等了一个小时都没有更新)。 #381

Closed FarranYang closed 2 years ago

FarranYang commented 2 years ago


2021-12-26 16:04:19,060 - mmtrack - INFO - Environment info:

sys.platform: linux Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] CUDA available: True GPU 0,1,2: GeForce RTX 2080 Ti CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.0, V10.0.130 GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.5.0 PyTorch compiling details: PyTorch built with:

TorchVision: 0.6.0a0+82fd1c8 OpenCV: 4.5.4 MMCV: 1.4.1 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMTracking: 0.8.0+

2021-12-26 16:04:19,061 - mmtrack - INFO - Distributed training: True 2021-12-26 16:04:19,761 - mmtrack - INFO - Config: model = dict( detector=dict( type='FasterRCNN', backbone=dict( type='ResNet', depth=50, num_stages=4, out_indices=(3, ), strides=(1, 2, 2, 1), dilations=(1, 1, 1, 2), frozen_stages=1, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch'), neck=dict( type='ChannelMapper', in_channels=[2048], out_channels=512, kernel_size=3), rpn_head=dict( type='RPNHead', in_channels=512, feat_channels=512, anchor_generator=dict( type='AnchorGenerator', scales=[4, 8, 16, 32], ratios=[0.5, 1.0, 2.0], strides=[16]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[1.0, 1.0, 1.0, 1.0]), loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0), loss_bbox=dict( type='SmoothL1Loss', beta=0.1111111111111111, loss_weight=1.0)), roi_head=dict( type='SelsaRoIHead', bbox_roi_extractor=dict( type='TemporalRoIAlign', roi_layer=dict( type='RoIAlign', output_size=7, sampling_ratio=2), out_channels=512, featmap_strides=[16], num_most_similar_points=2, num_temporal_attention_blocks=4), bbox_head=dict( type='SelsaBBoxHead', in_channels=512, fc_out_channels=1024, roi_feat_size=7, num_classes=30, bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.2, 0.2, 0.2, 0.2]), reg_class_agnostic=False, loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0), loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0), num_shared_fcs=3, aggregator=dict( type='SelsaAggregator', in_channels=1024, num_attention_blocks=16))), train_cfg=dict( rpn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.7, neg_iou_thr=0.3, min_pos_iou=0.3, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=256, pos_fraction=0.5, neg_pos_ub=-1, add_gt_as_proposals=False), allowed_border=0, pos_weight=-1, debug=False), rpn_proposal=dict( nms_pre=6000, max_per_img=600, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0.5, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=256, pos_fraction=0.25, neg_pos_ub=-1, add_gt_as_proposals=True), pos_weight=-1, debug=False)), test_cfg=dict( rpn=dict( nms_pre=6000, max_per_img=300, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( score_thr=0.0001, nms=dict(type='nms', iou_threshold=0.5), max_per_img=100))), type='SELSA') dataset_type = 'ImagenetVIDDataset' data_root = 'data/FALD_VID/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadMultiImagesFromFile'), dict(type='SeqLoadAnnotations', with_bbox=True, with_track=True), dict(type='SeqResize', img_scale=(1000, 600), keep_ratio=True), dict(type='SeqRandomFlip', share_params=True, flip_ratio=0.5), dict( type='SeqNormalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='SeqPad', size_divisor=16), dict( type='VideoCollect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_instance_ids']), dict(type='ConcatVideoReferences'), dict(type='SeqDefaultFormatBundle', ref_prefix='ref') ] test_pipeline = [ dict(type='LoadMultiImagesFromFile'), dict(type='SeqResize', img_scale=(1000, 600), keep_ratio=True), dict(type='SeqRandomFlip', share_params=True, flip_ratio=0.0), dict( type='SeqNormalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='SeqPad', size_divisor=16), dict( type='VideoCollect', keys=['img'], meta_keys=('num_left_ref_imgs', 'frame_stride')), dict(type='ConcatVideoReferences'), dict(type='MultiImagesToTensor', ref_prefix='ref'), dict(type='ToList') ] data = dict( samples_per_gpu=1, workers_per_gpu=2, train=dict( type='ImagenetVIDDataset', ann_file= 'data/FALD_VID/COCOVIDannotations/imagenet_vid_train_every10frames.json', img_prefix='data/FALD_VID/Data/VID', ref_img_sampler=dict( num_ref_imgs=2, frame_range=9, filter_key_img=False, method='bilateral_uniform'), pipeline=[ dict(type='LoadMultiImagesFromFile'), dict(type='SeqLoadAnnotations', with_bbox=True, with_track=True), dict(type='SeqResize', img_scale=(1000, 600), keep_ratio=True), dict(type='SeqRandomFlip', share_params=True, flip_ratio=0.5), dict( type='SeqNormalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='SeqPad', size_divisor=16), dict( type='VideoCollect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_instance_ids']), dict(type='ConcatVideoReferences'), dict(type='SeqDefaultFormatBundle', ref_prefix='ref') ]), val=dict( type='ImagenetVIDDataset', ann_file='data/FALD_VID/annotations/imagenet_vid_val.json', img_prefix='data/FALD_VID/Data/VID', ref_img_sampler=dict( num_ref_imgs=14, frame_range=[-7, 7], method='test_with_adaptive_stride'), pipeline=[ dict(type='LoadMultiImagesFromFile'), dict(type='SeqResize', img_scale=(1000, 600), keep_ratio=True), dict(type='SeqRandomFlip', share_params=True, flip_ratio=0.0), dict( type='SeqNormalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='SeqPad', size_divisor=16), dict( type='VideoCollect', keys=['img'], meta_keys=('num_left_ref_imgs', 'frame_stride')), dict(type='ConcatVideoReferences'), dict(type='MultiImagesToTensor', ref_prefix='ref'), dict(type='ToList') ], test_mode=True), test=dict( type='ImagenetVIDDataset', ann_file='data/FALD_VID/annotations/imagenet_vid_val.json', img_prefix='data/FALD_VID/Data/VID', ref_img_sampler=dict( num_ref_imgs=14, frame_range=[-7, 7], method='test_with_adaptive_stride'), pipeline=[ dict(type='LoadMultiImagesFromFile'), dict(type='SeqResize', img_scale=(1000, 600), keep_ratio=True), dict(type='SeqRandomFlip', share_params=True, flip_ratio=0.0), dict( type='SeqNormalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='SeqPad', size_divisor=16), dict( type='VideoCollect', keys=['img'], meta_keys=('num_left_ref_imgs', 'frame_stride')), dict(type='ConcatVideoReferences'), dict(type='MultiImagesToTensor', ref_prefix='ref'), dict(type='ToList') ], test_mode=True)) optimizer = dict(type='SGD', lr=0.005, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2)) checkpoint_config = dict(interval=1) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.3333333333333333, step=[2, 5]) total_epochs = 4 evaluation = dict(metric=['bbox'], interval=4) work_dir = './work_dirs/20211226_001_try3/' gpu_ids = range(0, 1)

2021-12-26 16:04:24,438 - mmtrack - INFO - Set random seed to 2034425034, deterministic: False 2021-12-26 16:04:25,201 - mmtrack - INFO - initialize ResNet with init_cfg [{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}] 2021-12-26 16:04:25,466 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,467 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,468 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,470 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,471 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,472 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,473 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,475 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,477 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,479 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,481 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,482 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,484 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,490 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,496 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,500 - mmtrack - INFO - initialize Bottleneck with init_cfg {'type': 'Constant', 'val': 0, 'override': {'name': 'norm3'}} 2021-12-26 16:04:25,523 - mmtrack - INFO - initialize ChannelMapper with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'} 2021-12-26 16:04:25,583 - mmtrack - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01} 2021-12-26 16:04:25,637 - mmtrack - INFO - initialize SelsaBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg'}}, {'type': 'Xavier', 'distribution': 'uniform', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}] Name of parameter - Initialization information

2021-12-26 16:04:28,460 - mmtrack - INFO - Start running, host: user-lbyjh@admin.cluster.local, work_dir: /data/yangjiahui/VIDProject/mmtracking/work_dirs/20211226_001_try3 2021-12-26 16:04:28,461 - mmtrack - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) CheckpointHook
(NORMAL ) DistEvalHook
(VERY_LOW ) TextLoggerHook

before_train_epoch: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) DistSamplerSeedHook
(NORMAL ) DistEvalHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

before_train_iter: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) DistEvalHook
(LOW ) IterTimerHook

after_train_iter: (ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(NORMAL ) DistEvalHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

after_train_epoch: (NORMAL ) CheckpointHook
(NORMAL ) DistEvalHook
(VERY_LOW ) TextLoggerHook

before_val_epoch: (NORMAL ) DistSamplerSeedHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook

before_val_iter: (LOW ) IterTimerHook

after_val_iter: (LOW ) IterTimerHook

after_val_epoch: (VERY_LOW ) TextLoggerHook

after_run: (VERY_LOW ) TextLoggerHook

2021-12-26 16:04:28,461 - mmtrack - INFO - workflow: [('train', 1)], max: 4 epochs 2021-12-26 16:04:28,461 - mmtrack - INFO - Checkpoints will be saved to /data/yangjiahui/VIDProject/mmtracking/work_dirs/20211226_001_try3 by HardDiskBackend. 2021-12-26 16:05:00,501 - mmtrack - INFO - Saving checkpoint at 1 epochs 2021-12-26 16:05:32,658 - mmtrack - INFO - Saving checkpoint at 2 epochs 2021-12-26 16:06:04,769 - mmtrack - INFO - Saving checkpoint at 3 epochs 2021-12-26 16:06:37,068 - mmtrack - INFO - Saving checkpoint at 4 epochs

FarranYang commented 2 years ago


FarranYang commented 2 years ago

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ] 5936/6186, 6.0 task/s, elapsed: 983s, ETA: 41s^CTraceback (most recent call last): File "/data/yangjiahui/envs/torch15/lib/python3.8/", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/data/yangjiahui/envs/torch15/lib/python3.8/", line 87, in _run_code exec(code, run_globals) File "/data/yangjiahui/envs/torch15/lib/python3.8/site-packages/torch/distributed/", line 263, in main() File "/data/yangjiahui/envs/torch15/lib/python3.8/site-packages/torch/distributed/", line 256, in main process.wait() File "/data/yangjiahui/envs/torch15/lib/python3.8/", line 1083, in wait return self._wait(timeout=timeout) File "/data/yangjiahui/envs/torch15/lib/python3.8/", line 1806, in _wait (pid, sts) = self._try_wait(0) File "/data/yangjiahui/envs/torch15/lib/python3.8/", line 1764, in _try_wait (pid, sts) = os.waitpid(, wait_flags) KeyboardInterrupt

GT9505 commented 2 years ago

The progress bar will be stuck since it only counts the processed images in rank 0, while the different ranks process different number of images in MMTracking. When the progress bar is stuck, please use nvidia-smi to observe whether there is a gpu still processing images. If so, please wait. If your dataset has a very long video, it is reasonable to be stuck for a long while.

FarranYang commented 2 years ago

thank you for your reply! when it stucks, I saw the processes by nviadia-smi,they keep running. I will wait for a longer period!

FarranYang commented 2 years ago

The progress bar will be stuck since it only counts the processed images in rank 0, while the different ranks process different number of images in MMTracking. When the progress bar is stuck, please use nvidia-smi to observe whether there is a gpu still processing images. If so, please wait. If your dataset has a very long video, it is reasonable to be stuck for a long while.

Hi,I have tried to run training process, but still got stuck in the evaluation . This time I waited for 21 hours and the process was not released, no errors were reported and no new logs were generated.

This is my stucked log: File "/data/yangjiahui/envs/torch15/lib/python3.8/site-packages/torch/nn/modules/", line 550, in call result = self.forward(*input, *kwargs) File "/data/yangjiahui/envs/torch15/lib/python3.8/site-packages/mmcv/runner/", line 186, in new_func return old_func(args, **kwargs) File "/data/yangjiahui/VIDProject/mmtracking/mmtrack/models/roi_heads/roi_extractors/", line 195, in forward ref_roi_feats = self.most_similar_roi_align( File "/data/yangjiahui/VIDProject/mmtracking/mmtrack/models/roi_heads/roi_extractors/", line 175, in most_similar_roi_align ref_roi_feats =, one_ref_roi_feats), RuntimeError: CUDA out of memory. Tried to allocate 402.00 MiB (GPU 3; 10.76 GiB total capacity; 3.32 GiB already allocated; 152.56 MiB free; 3.95 GiB reserved in total by PyTorch) [>>>>>>>>>>>>>>>>>>>>>>> ] 5592/6186, 10.5 task/s, elapsed: 534s, ETA: 57s

FarranYang commented 2 years ago

The progress bar will be stuck since it only counts the processed images in rank 0, while the different ranks process different number of images in MMTracking. When the progress bar is stuck, please use nvidia-smi to observe whether there is a gpu still processing images. If so, please wait. If your dataset has a very long video, it is reasonable to be stuck for a long while.


GT9505 commented 2 years ago

If your dataset has different categories with ImageNet VID, you need modify the CLASSES in ImagenetVIDDataset.

FarranYang commented 2 years ago

As you mentioned, I have changed the CLASSES in ImagenetVIDDataset and num_classes in base/models/,it still stucks.o(╥﹏╥)o

FarranYang commented 2 years ago

thank you for your reply, I rebuild my environment and it works well now!