ifzhang / ByteTrack

[ECCV 2022] ByteTrack: Multi-Object Tracking by Associating Every Detection Box
MIT License
4.66k stars 886 forks source link

Training time is too long #48

Open Four1996 opened 2 years ago

Four1996 commented 2 years ago

I usepython3 tools/train.py -f exps/example/mot/yolox_x_ablation.py -d 3 -b 8 --fp16 -o -c pretrained/yolox_x.pth in Train ablation model (MOT17 half train and CrowdHuman) ,got too long training time. My device: RTX 2080 ti x3, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz x4。 Here is my log file:

root@ai:/ai/data/ByteTrack-main# python3 tools/train.py -f exps/example/mot/yolox_x_ablation.py -d 3 -b 8 --fp16 -o -c pretrained/yolox_x.pth
2021-11-02 21:20:19.566 | INFO     | yolox.core.launch:launch_by_subprocess:145 - 
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2021-11-02 21:20:22.017 | INFO     | yolox.core.launch:_distributed_worker:184 - Rank 1 initialization finished.
2021-11-02 21:20:22.022 | INFO     | yolox.core.launch:_distributed_worker:184 - Rank 0 initialization finished.
2021-11-02 21:20:22.027 | INFO     | yolox.core.launch:_distributed_worker:184 - Rank 2 initialization finished.
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
2021-11-02 21:20:31 | INFO     | yolox.core.trainer:124 - args: Namespace(batch_size=8, ckpt='pretrained/yolox_x.pth', devices=3, dist_backend='nccl', dist_url=None, exp_file='exps/example/mot/yolox_x_ablation.py', experiment_name='yolox_x_ablation', fp16=True, local_rank=0, machine_rank=0, name=None, num_machines=1, occupy=True, opts=[], resume=False, start_epoch=None)
2021-11-02 21:20:31 | INFO     | yolox.core.trainer:125 - exp value:
╒══════════════════╤════════════════════╕
│ keys             │ values             │
╞══════════════════╪════════════════════╡
│ seed             │ None               │
├──────────────────┼────────────────────┤
│ output_dir       │ './YOLOX_outputs'  │
├──────────────────┼────────────────────┤
│ print_interval   │ 20                 │
├──────────────────┼────────────────────┤
│ eval_interval    │ 5                  │
├──────────────────┼────────────────────┤
│ num_classes      │ 1                  │
├──────────────────┼────────────────────┤
│ depth            │ 1.33               │
├──────────────────┼────────────────────┤
│ width            │ 1.25               │
├──────────────────┼────────────────────┤
│ data_num_workers │ 0                  │
├──────────────────┼────────────────────┤
│ input_size       │ (800, 1440)        │
├──────────────────┼────────────────────┤
│ random_size      │ (18, 32)           │
├──────────────────┼────────────────────┤
│ train_ann        │ 'train.json'       │
├──────────────────┼────────────────────┤
│ val_ann          │ 'val_half.json'    │
├──────────────────┼────────────────────┤
│ degrees          │ 10.0               │
├──────────────────┼────────────────────┤
│ translate        │ 0.1                │
├──────────────────┼────────────────────┤
│ scale            │ (0.1, 2)           │
├──────────────────┼────────────────────┤
│ mscale           │ (0.8, 1.6)         │
├──────────────────┼────────────────────┤
│ shear            │ 2.0                │
├──────────────────┼────────────────────┤
│ perspective      │ 0.0                │
├──────────────────┼────────────────────┤
│ enable_mixup     │ True               │
├──────────────────┼────────────────────┤
│ warmup_epochs    │ 1                  │
├──────────────────┼────────────────────┤
│ max_epoch        │ 80                 │
├──────────────────┼────────────────────┤
│ warmup_lr        │ 0                  │
├──────────────────┼────────────────────┤
│ basic_lr_per_img │ 1.5625e-05         │
├──────────────────┼────────────────────┤
│ scheduler        │ 'yoloxwarmcos'     │
├──────────────────┼────────────────────┤
│ no_aug_epochs    │ 10                 │
├──────────────────┼────────────────────┤
│ min_lr_ratio     │ 0.05               │
├──────────────────┼────────────────────┤
│ ema              │ True               │
├──────────────────┼────────────────────┤
│ weight_decay     │ 0.0005             │
├──────────────────┼────────────────────┤
│ momentum         │ 0.9                │
├──────────────────┼────────────────────┤
│ exp_name         │ 'yolox_x_ablation' │
├──────────────────┼────────────────────┤
│ test_size        │ (800, 1440)        │
├──────────────────┼────────────────────┤
│ test_conf        │ 0.1                │
├──────────────────┼────────────────────┤
│ nmsthre          │ 0.7                │
╘══════════════════╧════════════════════╛
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
2021-11-02 21:20:33 | INFO     | yolox.core.trainer:131 - Model Summary: Params: 99.00M, Gflops: 791.73
2021-11-02 21:20:33 | INFO     | yolox.core.trainer:289 - loading checkpoint for fine tuning
2021-11-02 21:20:37 | WARNING  | yolox.utils.checkpoint:27 - Shape of head.cls_preds.0.weight in checkpoint is torch.Size([80, 320, 1, 1]), while shape of head.cls_preds.0.weight in model is torch.Size([1, 320, 1, 1]).
2021-11-02 21:20:37 | WARNING  | yolox.utils.checkpoint:27 - Shape of head.cls_preds.0.bias in checkpoint is torch.Size([80]), while shape of head.cls_preds.0.bias in model is torch.Size([1]).
2021-11-02 21:20:37 | WARNING  | yolox.utils.checkpoint:27 - Shape of head.cls_preds.1.weight in checkpoint is torch.Size([80, 320, 1, 1]), while shape of head.cls_preds.1.weight in model is torch.Size([1, 320, 1, 1]).
2021-11-02 21:20:37 | WARNING  | yolox.utils.checkpoint:27 - Shape of head.cls_preds.1.bias in checkpoint is torch.Size([80]), while shape of head.cls_preds.1.bias in model is torch.Size([1]).
2021-11-02 21:20:37 | WARNING  | yolox.utils.checkpoint:27 - Shape of head.cls_preds.2.weight in checkpoint is torch.Size([80, 320, 1, 1]), while shape of head.cls_preds.2.weight in model is torch.Size([1, 320, 1, 1]).
2021-11-02 21:20:37 | WARNING  | yolox.utils.checkpoint:27 - Shape of head.cls_preds.2.bias in checkpoint is torch.Size([80]), while shape of head.cls_preds.2.bias in model is torch.Size([1]).
2021-11-02 21:20:37 | INFO     | yolox.data.datasets.mot:39 - loading annotations into memory...
2021-11-02 21:20:41 | INFO     | yolox.data.datasets.mot:39 - Done (t=4.29s)
2021-11-02 21:20:41 | INFO     | pycocotools.coco:88 - creating index...
2021-11-02 21:20:41 | INFO     | pycocotools.coco:88 - index created!
2021-11-02 21:20:44 | INFO     | yolox.core.trainer:148 - init prefetcher, this might take one minute or less...
2021-11-02 21:20:52 | INFO     | yolox.data.datasets.mot:39 - loading annotations into memory...
2021-11-02 21:20:53 | INFO     | yolox.data.datasets.mot:39 - Done (t=0.26s)
2021-11-02 21:20:53 | INFO     | pycocotools.coco:88 - creating index...
2021-11-02 21:20:53 | INFO     | pycocotools.coco:88 - index created!
2021-11-02 21:20:53 | INFO     | yolox.core.trainer:176 - Training start...
2021-11-02 21:20:53 | INFO     | yolox.core.trainer:187 - ---> start train epoch1
2021-11-02 21:21:25 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 20/3672, mem: 9013Mb, iter_time: 1.614s, data_time: 0.850s, total_loss: 9.063, iou_loss: 3.039, l1_loss: 0.000, conf_loss: 3.430, cls_loss: 2.594, lr: 3.708e-09, size: 640, ETA: 5 days, 11:39:28
2021-11-02 21:22:12 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 40/3672, mem: 9600Mb, iter_time: 2.322s, data_time: 1.259s, total_loss: 7.511, iou_loss: 2.473, l1_loss: 0.000, conf_loss: 2.175, cls_loss: 2.863, lr: 1.483e-08, size: 1024, ETA: 6 days, 16:31:56
2021-11-02 21:22:49 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 60/3672, mem: 9600Mb, iter_time: 1.886s, data_time: 0.978s, total_loss: 8.964, iou_loss: 2.844, l1_loss: 0.000, conf_loss: 3.534, cls_loss: 2.586, lr: 3.337e-08, size: 672, ETA: 6 days, 14:18:41
2021-11-02 21:23:23 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 80/3672, mem: 9600Mb, iter_time: 1.657s, data_time: 0.903s, total_loss: 9.546, iou_loss: 3.083, l1_loss: 0.000, conf_loss: 3.739, cls_loss: 2.723, lr: 5.933e-08, size: 960, ETA: 6 days, 8:30:58
2021-11-02 21:24:01 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 100/3672, mem: 9600Mb, iter_time: 1.923s, data_time: 1.024s, total_loss: 7.776, iou_loss: 3.002, l1_loss: 0.000, conf_loss: 2.293, cls_loss: 2.481, lr: 9.271e-08, size: 960, ETA: 6 days, 9:23:06
2021-11-02 21:24:40 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 120/3672, mem: 9600Mb, iter_time: 1.922s, data_time: 1.029s, total_loss: 9.406, iou_loss: 3.129, l1_loss: 0.000, conf_loss: 4.003, cls_loss: 2.274, lr: 1.335e-07, size: 960, ETA: 6 days, 9:56:22
2021-11-02 21:25:17 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 140/3672, mem: 9600Mb, iter_time: 1.860s, data_time: 1.005s, total_loss: 7.926, iou_loss: 3.104, l1_loss: 0.000, conf_loss: 2.596, cls_loss: 2.226, lr: 1.817e-07, size: 736, ETA: 6 days, 9:36:51
2021-11-02 21:25:53 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 160/3672, mem: 9600Mb, iter_time: 1.828s, data_time: 1.028s, total_loss: 8.205, iou_loss: 2.912, l1_loss: 0.000, conf_loss: 2.858, cls_loss: 2.434, lr: 2.373e-07, size: 832, ETA: 6 days, 9:02:10
2021-11-02 21:26:34 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 180/3672, mem: 9600Mb, iter_time: 2.031s, data_time: 1.078s, total_loss: 9.311, iou_loss: 3.174, l1_loss: 0.000, conf_loss: 4.104, cls_loss: 2.033, lr: 3.004e-07, size: 1024, ETA: 6 days, 10:25:29
2021-11-02 21:27:06 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 200/3672, mem: 9600Mb, iter_time: 1.585s, data_time: 0.873s, total_loss: 7.813, iou_loss: 2.607, l1_loss: 0.000, conf_loss: 2.924, cls_loss: 2.281, lr: 3.708e-07, size: 768, ETA: 6 days, 7:53:54
2021-11-02 21:27:40 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 220/3672, mem: 9600Mb, iter_time: 1.686s, data_time: 1.048s, total_loss: 8.038, iou_loss: 3.370, l1_loss: 0.000, conf_loss: 2.905, cls_loss: 1.762, lr: 4.487e-07, size: 800, ETA: 6 days, 6:34:33
2021-11-02 21:28:15 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 240/3672, mem: 9600Mb, iter_time: 1.755s, data_time: 1.005s, total_loss: 7.694, iou_loss: 2.848, l1_loss: 0.000, conf_loss: 3.109, cls_loss: 1.737, lr: 5.340e-07, size: 736, ETA: 6 days, 5:56:45
2021-11-02 21:28:48 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 260/3672, mem: 9600Mb, iter_time: 1.657s, data_time: 0.970s, total_loss: 7.517, iou_loss: 2.833, l1_loss: 0.000, conf_loss: 3.048, cls_loss: 1.636, lr: 6.267e-07, size: 768, ETA: 6 days, 4:47:29
2021-11-02 21:29:16 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 280/3672, mem: 9600Mb, iter_time: 1.421s, data_time: 0.888s, total_loss: 7.343, iou_loss: 2.724, l1_loss: 0.000, conf_loss: 2.876, cls_loss: 1.743, lr: 7.268e-07, size: 768, ETA: 6 days, 2:25:49
2021-11-02 21:29:54 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 300/3672, mem: 9600Mb, iter_time: 1.881s, data_time: 1.031s, total_loss: 7.612, iou_loss: 3.020, l1_loss: 0.000, conf_loss: 3.245, cls_loss: 1.348, lr: 8.343e-07, size: 736, ETA: 6 days, 2:52:48
2021-11-02 21:30:34 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 320/3672, mem: 9600Mb, iter_time: 1.975s, data_time: 1.074s, total_loss: 6.557, iou_loss: 2.583, l1_loss: 0.000, conf_loss: 2.789, cls_loss: 1.186, lr: 9.493e-07, size: 1024, ETA: 6 days, 3:45:15
2021-11-02 21:31:11 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 340/3672, mem: 9600Mb, iter_time: 1.876s, data_time: 1.092s, total_loss: 6.434, iou_loss: 2.880, l1_loss: 0.000, conf_loss: 2.601, cls_loss: 0.952, lr: 1.072e-06, size: 992, ETA: 6 days, 4:02:43
2021-11-02 21:31:47 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 360/3672, mem: 9600Mb, iter_time: 1.785s, data_time: 1.044s, total_loss: 7.719, iou_loss: 3.052, l1_loss: 0.000, conf_loss: 3.379, cls_loss: 1.288, lr: 1.201e-06, size: 960, ETA: 6 days, 3:53:39
2021-11-02 21:32:22 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 380/3672, mem: 9600Mb, iter_time: 1.768s, data_time: 0.969s, total_loss: 8.004, iou_loss: 3.042, l1_loss: 0.000, conf_loss: 4.052, cls_loss: 0.910, lr: 1.339e-06, size: 576, ETA: 6 days, 3:41:08
2021-11-02 21:32:58 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 400/3672, mem: 9600Mb, iter_time: 1.778s, data_time: 1.103s, total_loss: 6.296, iou_loss: 2.513, l1_loss: 0.000, conf_loss: 2.812, cls_loss: 0.971, lr: 1.483e-06, size: 736, ETA: 6 days, 3:32:10
2021-11-02 21:33:34 | INFO     | yolox.core.trainer:250 - epoch: 1/80, iter: 420/3672, mem: 9600Mb, iter_time: 1.829s, data_time: 1.039s, total_loss: 7.636, iou_loss: 3.079, l1_loss: 0.000, conf_loss: 3.628, cls_loss: 0.929, lr: 1.635e-06, size: 960, ETA: 6 days, 3:35:56
Four1996 commented 2 years ago

now i solved it with smaller input size

ret-1 commented 2 years ago

Hello @Four1996 I'm training on a custom dataset and it seems that it costs nearly a week to complete 80 epochs. I just modified the annotation file path in yolox_x_mix_det.py for my exp, and the training data consists of 42750 imgs with 1920x1080. How did you set the input size? The time is just too long for me.

My device: Quadro RTX 8000 x8

ret-1 commented 2 years ago

In my log the data_time can be up to 30s, I guess that may be the reason. I'm trying to solve it.

ret-1 commented 2 years ago

Having moved the data to SSD, the problem has been solved.