Open luisfra19 opened 2 years ago
Hello, after installing all the recommended packages and successfully runnning inference on demo images with the original scripts, I face a training obstruction.
Old issue similar to #https://github.com/microsoft/SoftTeacher/issues/87#issuecomment-965242334
I train with bash tools/dist_train_partially.sh baseline 1 1 1
bash tools/dist_train_partially.sh baseline 1 1 1
+ TYPE=baseline + FOLD=1 + PERCENT=1 + GPUS=1 + PORT=29500 ++ dirname tools/dist_train_partially.sh + PYTHONPATH=tools/..: + [[ baseline == \b\a\s\e\l\i\n\e ]] ++ dirname tools/dist_train_partially.sh + python -m torch.distributed.launch --nproc_per_node=1 --master_port=29500 tools/train.py configs/baseline/faster_rcnn_r50_caffe_fpn_coco_partial_180k.py --launcher pytorch --cfg-options fold=1 percent=1 /home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions FutureWarning, /mnt/c/Users/Francisco Pereira/Desktop/IST/10º semestre/algoritmos/SoftTeacher/thirdparty/mmdetection/mmdet/datasets/pipelines/formating.py:7: UserWarning: DeprecationWarning: mmdet.datasets.pipelines.formating will be deprecated, please replace it with mmdet.datasets.pipelines.formatting. warnings.warn('DeprecationWarning: mmdet.datasets.pipelines.formating will be ' 2022-04-20 15:52:00,103 - mmdet.ssod - INFO - [<StreamHandler <stderr> (INFO)>, <FileHandler /mnt/c/Users/Francisco Pereira/Desktop/IST/10º semestre/algoritmos/SoftTeacher/work_dirs/faster_rcnn_r50_caffe_fpn_coco_partial_180k/1/1/20220420_155159.log (INFO)>] 2022-04-20 15:52:00,104 - mmdet.ssod - INFO - Environment info: ------------------------------------------------------------ sys.platform: linux Python: 3.6.13 |Anaconda, Inc.| (default, Jun 4 2021, 14:25:59) [GCC 7.5.0] CUDA available: True GPU 0: NVIDIA GeForce GTX 1650 with Max-Q Design CUDA_HOME: None GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 PyTorch: 1.10.2 PyTorch compiling details: PyTorch built with: - GCC 7.3 - C++ Version: 201402 - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740) - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX2 - CUDA Runtime 10.2 - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37 - CuDNN 7.6.5 - Magma 2.5.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, TorchVision: 0.11.3 OpenCV: 4.5.4-dev MMCV: 1.4.7 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.2 MMDetection: 2.23.0+bef9a25 ------------------------------------------------------------ 2022-04-20 15:52:00,864 - mmdet.ssod - INFO - Distributed training: True 2022-04-20 15:52:01,603 - mmdet.ssod - INFO - Config: model = dict( type='FasterRCNN', backbone=dict( type='ResNet', depth=50, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, norm_cfg=dict(type='BN', requires_grad=False), norm_eval=True, style='caffe', init_cfg=dict( type='Pretrained', checkpoint='open-mmlab://detectron2/resnet50_caffe')), neck=dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, num_outs=5), rpn_head=dict( type='RPNHead', in_channels=256, feat_channels=256, anchor_generator=dict( type='AnchorGenerator', scales=[8], ratios=[0.5, 1.0, 2.0], strides=[4, 8, 16, 32, 64]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[1.0, 1.0, 1.0, 1.0]), loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0), loss_bbox=dict(type='L1Loss', loss_weight=1.0)), roi_head=dict( type='StandardRoIHead', bbox_roi_extractor=dict( type='SingleRoIExtractor', roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0), out_channels=256, featmap_strides=[4, 8, 16, 32]), bbox_head=dict( type='Shared2FCBBoxHead', in_channels=256, fc_out_channels=1024, roi_feat_size=7, num_classes=80, bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.1, 0.1, 0.2, 0.2]), reg_class_agnostic=False, loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0), loss_bbox=dict(type='L1Loss', loss_weight=1.0))), train_cfg=dict( rpn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.7, neg_iou_thr=0.3, min_pos_iou=0.3, match_low_quality=True, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=256, pos_fraction=0.5, neg_pos_ub=-1, add_gt_as_proposals=False), allowed_border=-1, pos_weight=-1, debug=False), rpn_proposal=dict( nms_pre=2000, max_per_img=1000, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0.5, match_low_quality=False, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=512, pos_fraction=0.25, neg_pos_ub=-1, add_gt_as_proposals=True), pos_weight=-1, debug=False)), test_cfg=dict( rpn=dict( nms_pre=1000, max_per_img=1000, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( score_thr=0.05, nms=dict(type='nms', iou_threshold=0.5), max_per_img=100))) dataset_type = 'CocoDataset' data_root = 'data/coco/' img_norm_cfg = dict( mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Sequential', transforms=[ dict( type='RandResize', img_scale=[(1333, 400), (1333, 1200)], multiscale_mode='range', keep_ratio=True), dict(type='RandFlip', flip_ratio=0.5), dict( type='OneOf', transforms=[ dict(type='Identity'), dict(type='AutoContrast'), dict(type='RandEqualize'), dict(type='RandSolarize'), dict(type='RandColor'), dict(type='RandContrast'), dict(type='RandBrightness'), dict(type='RandSharpness'), dict(type='RandPosterize') ]) ]), dict(type='Pad', size_divisor=32), dict( type='Normalize', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='ExtraAttrs', tag='sup'), dict(type='DefaultFormatBundle'), dict( type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'], meta_keys=('filename', 'ori_shape', 'img_shape', 'img_norm_cfg', 'pad_shape', 'scale_factor', 'tag')) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=1, workers_per_gpu=1, train=dict( type='CocoDataset', ann_file= 'data/coco/annotations/semi_supervised/instances_train2017.1@1.json', img_prefix='data/coco/train2017/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Sequential', transforms=[ dict( type='RandResize', img_scale=[(1333, 400), (1333, 1200)], multiscale_mode='range', keep_ratio=True), dict(type='RandFlip', flip_ratio=0.5), dict( type='OneOf', transforms=[ dict(type='Identity'), dict(type='AutoContrast'), dict(type='RandEqualize'), dict(type='RandSolarize'), dict(type='RandColor'), dict(type='RandContrast'), dict(type='RandBrightness'), dict(type='RandSharpness'), dict(type='RandPosterize') ]) ]), dict(type='Pad', size_divisor=32), dict( type='Normalize', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='ExtraAttrs', tag='sup'), dict(type='DefaultFormatBundle'), dict( type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'], meta_keys=('filename', 'ori_shape', 'img_shape', 'img_norm_cfg', 'pad_shape', 'scale_factor', 'tag')) ]), val=dict( type='CocoDataset', ann_file='data/coco/annotations/instances_val2017.json', img_prefix='data/coco/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='CocoDataset', ann_file='data/coco/annotations/instances_val2017.json', img_prefix='data/coco/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ])) evaluation = dict(interval=4000, metric='bbox') optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=None) lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.001, step=[120000, 160000]) runner = dict(type='IterBasedRunner', max_iters=180000) checkpoint_config = dict(interval=4000, by_epoch=False, max_keep_ckpts=10) log_config = dict( interval=50, hooks=[ dict(type='TextLoggerHook'), dict( type='WandbLoggerHook', init_kwargs=dict( project='pre_release', name='faster_rcnn_r50_caffe_fpn_coco_partial_180k', config=dict( fold=1, percent=1, work_dirs='work_dirs/${cfg_name}/${percent}/${fold}', total_step=180000)), by_epoch=False) ]) custom_hooks = [dict(type='NumClassCheckHook')] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' mmdet_base = '../../thirdparty/mmdetection/configs/_base_' fp16 = dict(loss_scale='dynamic') fold = 1 percent = 1 work_dir = 'work_dirs/faster_rcnn_r50_caffe_fpn_coco_partial_180k/1/1' cfg_name = 'faster_rcnn_r50_caffe_fpn_coco_partial_180k' gpu_ids = range(0, 1) 2022-04-20 15:52:02,392 - mmdet.ssod - INFO - initialize ResNet with init_cfg {'type': 'Pretrained', 'checkpoint': 'open-mmlab://detectron2/resnet50_caffe'} 2022-04-20 15:52:02,394 - mmcv - INFO - load model from: open-mmlab://detectron2/resnet50_caffe 2022-04-20 15:52:02,395 - mmcv - INFO - load checkpoint from openmmlab path: open-mmlab://detectron2/resnet50_caffe 2022-04-20 15:52:03,208 - mmcv - WARNING - The model and loaded state dict do not match exactly unexpected key in source state_dict: conv1.bias 2022-04-20 15:52:03,236 - mmdet.ssod - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'} 2022-04-20 15:52:03,275 - mmdet.ssod - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01} 2022-04-20 15:52:03,284 - mmdet.ssod - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg'}}, {'type': 'Xavier', 'distribution': 'uniform', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}] loading annotations into memory... Done (t=0.02s) creating index... index created! loading annotations into memory... Done (t=0.01s) creating index... index created! 2022-04-20 15:52:10,631 - mmdet.ssod - INFO - Start running, host: lfgp@LAPTOP-VI0T98FT, work_dir: /mnt/c/Users/Francisco Pereira/Desktop/IST/10º semestre/algoritmos/SoftTeacher/work_dirs/faster_rcnn_r50_caffe_fpn_coco_partial_180k/1/1 2022-04-20 15:52:10,633 - mmdet.ssod - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) StepLrUpdaterHook (ABOVE_NORMAL) Fp16OptimizerHook (NORMAL ) CheckpointHook (80 ) DistEvalHook (VERY_LOW ) TextLoggerHook (VERY_LOW ) WandbLoggerHook -------------------- before_train_epoch: (VERY_HIGH ) StepLrUpdaterHook (NORMAL ) NumClassCheckHook (LOW ) IterTimerHook (80 ) DistEvalHook (VERY_LOW ) TextLoggerHook (VERY_LOW ) WandbLoggerHook -------------------- before_train_iter: (VERY_HIGH ) StepLrUpdaterHook (LOW ) IterTimerHook (80 ) DistEvalHook -------------------- after_train_iter: (ABOVE_NORMAL) Fp16OptimizerHook (NORMAL ) CheckpointHook (LOW ) IterTimerHook (80 ) DistEvalHook (VERY_LOW ) TextLoggerHook (VERY_LOW ) WandbLoggerHook -------------------- after_train_epoch: (NORMAL ) CheckpointHook (80 ) DistEvalHook (VERY_LOW ) TextLoggerHook (VERY_LOW ) WandbLoggerHook -------------------- before_val_epoch: (NORMAL ) NumClassCheckHook (LOW ) IterTimerHook (VERY_LOW ) TextLoggerHook (VERY_LOW ) WandbLoggerHook -------------------- before_val_iter: (LOW ) IterTimerHook -------------------- after_val_iter: (LOW ) IterTimerHook -------------------- after_val_epoch: (VERY_LOW ) TextLoggerHook (VERY_LOW ) WandbLoggerHook -------------------- after_run: (VERY_LOW ) TextLoggerHook (VERY_LOW ) WandbLoggerHook -------------------- 2022-04-20 15:52:10,635 - mmdet.ssod - INFO - workflow: [('train', 1)], max: 180000 iters 2022-04-20 15:52:10,638 - mmdet.ssod - INFO - Checkpoints will be saved to /mnt/c/Users/Francisco Pereira/Desktop/IST/10º semestre/algoritmos/SoftTeacher/work_dirs/faster_rcnn_r50_caffe_fpn_coco_partial_180k/1/1 by HardDiskBackend. wandb: Currently logged in as: lfgp (use `wandb login --relogin` to force relogin) wandb: wandb version 0.12.14 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.10.31 wandb: Syncing run faster_rcnn_r50_caffe_fpn_coco_partial_180k wandb: ⭐️ View project at https://wandb.ai/lfgp/pre_release wandb: 🚀 View run at https://wandb.ai/lfgp/pre_release/runs/2c7vaz2t wandb: Run data is saved locally in /mnt/c/Users/Francisco Pereira/Desktop/IST/10º semestre/algoritmos/SoftTeacher/wandb/run-20220420_155211-2c7vaz2t wandb: Run `wandb offline` to turn off syncing. Traceback (most recent call last): File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 32, in __next__ data = next(self.iter_loader) File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__ data = self._next_data() File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1176, in _next_data raise StopIteration StopIteration During handling of the above exception, another exception occurred: Traceback (most recent call last): File "tools/train.py", line 198, in <module> main() File "tools/train.py", line 193, in main meta=meta, File "/mnt/c/Users/Francisco Pereira/Desktop/IST/10º semestre/algoritmos/SoftTeacher/ssod/apis/train.py", line 206, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run iter_runner(iter_loaders[i], **kwargs) File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 59, in train data_batch = next(data_loader) File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/mmcv/runner/iter_based_runner.py", line 39, in __next__ data = next(self.iter_loader) File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__ data = self._next_data() File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1176, in _next_data raise StopIteration StopIteration wandb: Waiting for W&B process to finish, PID 1648 wandb: Program failed with code 1. Press ctrl-c to abort syncing. wandb: wandb: Find user logs for this run at: /mnt/c/Users/Francisco Pereira/Desktop/IST/10º semestre/algoritmos/SoftTeacher/wandb/run-20220420_155211-2c7vaz2t/logs/debug.log wandb: Find internal logs for this run at: /mnt/c/Users/Francisco Pereira/Desktop/IST/10º semestre/algoritmos/SoftTeacher/wandb/run-20220420_155211-2c7vaz2t/logs/debug-internal.log wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: wandb: Synced faster_rcnn_r50_caffe_fpn_coco_partial_180k: https://wandb.ai/lfgp/pre_release/runs/2c7vaz2t ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1480) of binary: /home/lfgp/anaconda3/envs/soft/bin/python Traceback (most recent call last): File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in <module> main() File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run )(*cmd_args) File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lfgp/anaconda3/envs/soft/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ tools/train.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-04-20_15:52:43 host : LAPTOP-VI0T98FT. rank : 0 (local_rank: 0) exitcode : 1 (pid: 1480) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`
Any updates ? i'm having the same problem using mmseg
Hello, after installing all the recommended packages and successfully runnning inference on demo images with the original scripts, I face a training obstruction.
Old issue similar to #https://github.com/microsoft/SoftTeacher/issues/87#issuecomment-965242334
I train with
bash tools/dist_train_partially.sh baseline 1 1 1