supperted825 / FairMOT-X

FairMOT for Multi-Class MOT using YOLOX as Detector
MIT License
52 stars 11 forks source link

RuntimeError in multi-gpu train #19

Closed LX0912R closed 1 year ago

LX0912R commented 1 year ago

Hello, thanks for sharing the code. The code training works well on a single GPU, but it has runtime error with multi-gpu, as shown below.

(fairmot) vot@rog:~/work/FairMOT-X$ cd /home/vot/work/FairMOT-X ; /usr/bin/env /home/vot/anaconda3/envs/fairmot/bin/python /home/vot/.vscode/extensions/ms-python.python-2023.2.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 41967 -- /home/vot/work/FairMOT-X/src/train.py Using tensorboardX Training Chunk Sizes: [1, 1] The output will be saved to /home/vot/work/FairMOT-X/src/lib/../../exp/mot/default Setting up data... Dataset Root: /home/vot/work/dataset Loading cached ID Dict... Using Model Scale for YOLOX-L Heads are Predefined in YOLOX! opt: Namespace(K=200, arch='yolox', augment=False, batch_size=2, cat_spec_wh=False, chunk_sizes=[1, 1], conf_thre=0.4, data_cfg='src/lib/cfg/mot17.json', data_dir='/hpctmp/e0425991/datasets/bdd100k/bdd100k/MOT', dataset='jde', debug_dir='/home/vot/work/FairMOT-X/src/lib/../../exp/mot/default/debug', dense_wh=False, det_thre=0.4, detection_only=False, down_ratio=8, exp_dir='/home/vot/work/FairMOT-X/src/lib/../../exp/mot', exp_id='default', fix_res=True, freeze_backbone=False, gen_scale=True, gpus=[0, 1], gpus_str='0, 1', head_conv=256, hide_data_time=False, hm_weight=1, id_loss='ce', id_weight=1, input_h=-1, input_img='/users/duanyou/c5/all_pretrain/test.txt', input_mode='video', input_res=-1, input_video='', input_w=-1, input_wh=(1024, 576), is_debug=False, keep_res=False, kitti_test=False, l1_loss=False, load_model='', lr=7e-05, lr_step=[20, 35, 40, 50, 60, 75, 80], master_batch_size=1, metric='loss', min_box_area=20, mosaic=False, multi_scale=False, nID_dict=defaultdict(<class 'int'>, {}), nms_thre=0.4, norm_wh=False, not_cuda_benchmark=False, not_prefetch_test=False, not_reg_offset=False, num_classes=8, num_epochs=20, num_iters=-1, num_stacks=1, num_workers=4, off_weight=1, output_format='video', output_root='../results', pad=31, post_conv_layers=0, print_iter=0, reg_loss='l1', reg_offset=True, reid_cls_ids='0,1,2,3,4,5,6,7', reid_dim=128, reid_only=False, resume=False, root_dir='/home/vot/work/FairMOT-X/src/lib/../..', save_all=False, save_dir='/home/vot/work/FairMOT-X/src/lib/../../exp/mot/default', seed=tensor(253), start_epoch=1, task='mot', test=False, test_det=False, test_emb=False, test_mot15=False, test_mot16=False, test_mot17=False, test_mot20=False, track_buffer=30, trainval=False, uncertainty_loss=True, val=False, val_intervals=10, val_mot15=False, val_mot16=False, val_mot17=False, val_mot20=False, vis_thresh=0.5, wh_weight=0.1, yolo='l', yolo_depth=1.0, yolo_width=1.0) Creating model... Starting training... /home/vot/anaconda3/envs/fairmot/lib/python3.7/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] Traceback (most recent call last): File "/home/vot/anaconda3/envs/fairmot/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/vot/anaconda3/envs/fairmot/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/vot/.vscode/extensions/ms-python.python-2023.2.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/main.py", line 39, in cli.main() File "/home/vot/.vscode/extensions/ms-python.python-2023.2.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main run() File "/home/vot/.vscode/extensions/ms-python.python-2023.2.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file runpy.run_path(target, run_name="main") File "/home/vot/.vscode/extensions/ms-python.python-2023.2.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 322, in run_path pkg_name=pkg_name, script_name=fname) File "/home/vot/.vscode/extensions/ms-python.python-2023.2.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 136, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "/home/vot/.vscode/extensions/ms-python.python-2023.2.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code exec(code, run_globals) File "/home/vot/work/FairMOT-X/src/train.py", line 132, in run(opt) File "/home/vot/work/FairMOT-X/src/train.py", line 101, in run log_dicttrain, = trainer.train(epoch, train_loader) File "/home/vot/work/FairMOT-X/src/lib/trains/yolotrainer.py", line 141, in train return self.run_epoch('train', epoch, data_loader) File "/home/vot/work/FairMOT-X/src/lib/trains/yolotrainer.py", line 88, in run_epoch loss, loss_stats = model.forward(imgs, (det_labels, track_ids)) File "/home/vot/anaconda3/envs/fairmot/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/vot/anaconda3/envs/fairmot/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/vot/anaconda3/envs/fairmot/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/vot/anaconda3/envs/fairmot/lib/python3.7/site-packages/torch/_utils.py", line 457, in reraise raise exception RuntimeError: Caught RuntimeError in replica 1 on device 1. Original Traceback (most recent call last): File "/home/vot/anaconda3/envs/fairmot/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/home/vot/anaconda3/envs/fairmot/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/vot/work/FairMOT-X/src/lib/models/networks/yoloX.py", line 35, in forward loss, iou_loss, conf_loss, cls_loss, l1_loss, reid_loss, num_fg = self.head(fpn_outs, targets, x) File "/home/vot/anaconda3/envs/fairmot/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/home/vot/work/FairMOT-X/src/lib/models/networks/yolox/yolo_head.py", line 255, in forward output, grid = self.get_output_and_grid(output, k, stride_this_level, xin[0].type()) File "/home/vot/work/FairMOT-X/src/lib/models/networks/yolox/yolo_head.py", line 333, in get_output_and_grid output[..., :2] = (output[..., :2] + grid) stride RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

And I also found the same issue under project YOLOX(https://github.com/Megvii-BaseDetection/YOLOX/issues/286#issue-957183543). Any suggestions would be appreciated.