tinyvision / DAMO-YOLO

DAMO-YOLO: a fast and accurate object detection method with some new techs, including NAS backbones, efficient RepGFPN, ZeroHead, AlignedOTA, and distillation enhancement.
Apache License 2.0
3.75k stars 470 forks source link

[Bug]: training on custom data does not start #107

Closed KotovNikitaStudent closed 1 year ago

KotovNikitaStudent commented 1 year ago

Before Reporting

Search before reporting

OS

Ubuntu

Device

NVIDIA RTX A5000

CUDA version

11.6

TensorRT version

No response

Python version

3.8

PyTorch version

1.11.0a0+bfe5ad2

torchvision version

latest

Describe the bug

2023-05-17 19:26:19.228 | INFO | damo.apis.detector_trainer:init:114 - args info: Namespace(config_file='configs/damoyolo_tinynasL20_T.py', local_rank=0, opts=[], tea_ckpt=None, tea_config=None) 2023-05-17 19:26:19.233 | INFO | damo.apis.detector_trainer:init:115 - cfg value: ╒═════════╤═════════════════════════════════════════════════════════════════════════════════════════════════╕ │ keys │ values │ ╞═════════╪═════════════════════════════════════════════════════════════════════════════════════════════════╡ │ model │ {'backbone': {'act': 'relu', │ │ │ 'name': 'TinyNAS_res', │ │ │ 'net_structure_str': "[ {'class': 'ConvKXBNRELU', 'in': 3, 'k': " │ │ │ "3, 'nbitsA': 8, 'nbitsW': 8, 'out': 24, " │ │ │ "'s': 1},\n" │ │ │ " { 'L': 2,\n" │ │ │ " 'btn': 24,\n" │ │ │ " 'class': 'SuperResConvK1KX',\n" │ │ │ " 'in': 24,\n" │ │ │ " 'inner_class': 'ResConvK1KX',\n" │ │ │ " 'k': 3,\n" │ │ │ " 'nbitsA': [8, 8, 8, 8],\n" │ │ │ " 'nbitsW': [8, 8, 8, 8],\n" │ │ │ " 'out': 64,\n" │ │ │ " 's': 2},\n" │ │ │ " { 'L': 2,\n" │ │ │ " 'btn': 64,\n" │ │ │ " 'class': 'SuperResConvK1KX',\n" │ │ │ " 'in': 64,\n" │ │ │ " 'inner_class': 'ResConvK1KX',\n" │ │ │ " 'k': 3,\n" │ │ │ " 'nbitsA': [8, 8, 8, 8],\n" │ │ │ " 'nbitsW': [8, 8, 8, 8],\n" │ │ │ " 'out': 96,\n" │ │ │ " 's': 2},\n" │ │ │ " { 'L': 2,\n" │ │ │ " 'btn': 96,\n" │ │ │ " 'class': 'SuperResConvK1KX',\n" │ │ │ " 'in': 96,\n" │ │ │ " 'inner_class': 'ResConvK1KX',\n" │ │ │ " 'k': 3,\n" │ │ │ " 'nbitsA': [8, 8, 8, 8],\n" │ │ │ " 'nbitsW': [8, 8, 8, 8],\n" │ │ │ " 'out': 192,\n" │ │ │ " 's': 2},\n" │ │ │ " { 'L': 2,\n" │ │ │ " 'btn': 152,\n" │ │ │ " 'class': 'SuperResConvK1KX',\n" │ │ │ " 'in': 192,\n" │ │ │ " 'inner_class': 'ResConvK1KX',\n" │ │ │ " 'k': 3,\n" │ │ │ " 'nbitsA': [8, 8, 8, 8],\n" │ │ │ " 'nbitsW': [8, 8, 8, 8],\n" │ │ │ " 'out': 192,\n" │ │ │ " 's': 1},\n" │ │ │ " { 'L': 1,\n" │ │ │ " 'btn': 192,\n" │ │ │ " 'class': 'SuperResConvK1KX',\n" │ │ │ " 'in': 192,\n" │ │ │ " 'inner_class': 'ResConvK1KX',\n" │ │ │ " 'k': 3,\n" │ │ │ " 'nbitsA': [8, 8],\n" │ │ │ " 'nbitsW': [8, 8],\n" │ │ │ " 'out': 384,\n" │ │ │ " 's': 2}]\n", │ │ │ 'out_indices': [2, 4, 5], │ │ │ 'reparam': True, │ │ │ 'use_focus': True, │ │ │ 'with_spp': True}, │ │ │ 'head': {'act': 'silu', │ │ │ 'in_channels': [64, 128, 256], │ │ │ 'legacy': False, │ │ │ 'name': 'ZeroHead', │ │ │ 'nms_conf_thre': 0.05, │ │ │ 'nms_iou_thre': 0.7, │ │ │ 'num_classes': 1, │ │ │ 'reg_max': 16, │ │ │ 'stacked_convs': 0}, │ │ │ 'neck': {'act': 'relu', │ │ │ 'block_name': 'BasicBlock_3x3_Reverse', │ │ │ 'depth': 1.0, │ │ │ 'hidden_ratio': 1.0, │ │ │ 'in_channels': [96, 192, 384], │ │ │ 'name': 'GiraffeNeckV2', │ │ │ 'out_channels': [64, 128, 256], │ │ │ 'spp': False}} │ ├─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┤ │ train │ {'augment': {'mosaic_mixup': {'degrees': 10.0, │ │ │ 'keep_ratio': False, │ │ │ 'mixup_prob': 0.15, │ │ │ 'mixup_scale': [0.5, 1.5], │ │ │ 'mosaic_prob': 1.0, │ │ │ 'mosaic_scale': [0.1, 2.0], │ │ │ 'mosaic_size': [640, 640], │ │ │ 'shear': 0.2, │ │ │ 'translate': 0.2}, │ │ │ 'transform': {'autoaug_dict': {'autoaug_params': [6, 9, 5, 3, 3, 4, │ │ │ 2, 4, 4, 4, 5, 2, │ │ │ 4, 1, 4, 2, 6, 4, │ │ │ 2, 2, 2, 6, 2, 2, │ │ │ 2, 0, 5, 1, 3, 0, │ │ │ 8, 5, 2, 8, 7, 5, │ │ │ 1, 3, 3, 3], │ │ │ 'box_prob': 0.3, │ │ │ 'num_subpolicies': 5, │ │ │ 'scale_splits': [2048, 10240, │ │ │ 51200]}, │ │ │ 'flip_prob': 0.5, │ │ │ 'image_max_range': [640, 640], │ │ │ 'image_mean': [0.0, 0.0, 0.0], │ │ │ 'image_std': [1.0, 1.0, 1.0], │ │ │ 'keep_ratio': False}}, │ │ │ 'base_lr_per_img': 0.00015625, │ │ │ 'batch_size': 16, │ │ │ 'ema': True, │ │ │ 'ema_momentum': 0.9998, │ │ │ 'finetune_path': '/home/n.kotov1/colonoscopy/DAMO-YOLO/weights/damoyolo_tinynasL20_T_420.pth', │ │ │ 'min_lr_ratio': 0.05, │ │ │ 'momentum': 0.9, │ │ │ 'no_aug_epochs': 16, │ │ │ 'optimizer': {'lr': 0.04, │ │ │ 'momentum': 0.9, │ │ │ 'name': 'SGD', │ │ │ 'nesterov': True, │ │ │ 'weight_decay': 0.0005}, │ │ │ 'resume_path': None, │ │ │ 'total_epochs': 200, │ │ │ 'warmup_epochs': 5, │ │ │ 'warmup_start_lr': 0, │ │ │ 'weight_decay': 0.0005} │ ├─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┤ │ test │ {'augment': {'transform': {'flip_prob': 0.0, │ │ │ 'image_max_range': [640, 640], │ │ │ 'image_mean': [0.0, 0.0, 0.0], │ │ │ 'image_std': [1.0, 1.0, 1.0], │ │ │ 'keep_ratio': False}}, │ │ │ 'batch_size': 128} │ ├─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┤ │ dataset │ {'aspect_ratio_grouping': False, │ │ │ 'class_names': ['polyp'], │ │ │ 'data_dir': None, │ │ │ 'paths_catalog': '/home/n.kotov1/colonoscopy/DAMO-YOLO/damo/config/paths_catalog.py', │ │ │ 'train_ann': ['sample_train_coco'], │ │ │ 'val_ann': ['sample_val_coco']} │ ├─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┤ │ miscs │ {'ckpt_interval_epochs': 10, │ │ │ 'eval_interval_epochs': 10, │ │ │ 'exp_name': 'damoyolo_tinynasL20_T', │ │ │ 'num_workers': 4, │ │ │ 'output_dir': './workdirs', │ │ │ 'print_interval_iters': 10, │ │ │ 'seed': 1234} │ ╘═════════╧═════════════════════════════════════════════════════════════════════════════════════════════════╛ 2023-05-17 19:26:21.009 | INFO | damo.apis.detector_trainer:init:120 - model: 2023-05-17 19:26:21.055 | INFO | damo.detectors.detector:load_pretrain_detector:42 - Finetune from /home/n.kotov1/colonoscopy/DAMO-YOLO/weights/damoyolo_tinynasL20_T_420.pth................ 2023-05-17 19:26:21.349 | INFO | damo.apis.detector_trainer:init:158 - Enable ema model! Ema model will be evaluated and saved. 2023-05-17 19:26:21.637 | INFO | importlib._bootstrap:_call_with_frames_removed:219 - NOTE! Installing ujson may make loading annotations faster. 2023-05-17 19:26:22.083 | INFO | torchvision.datasets.coco:init:36 - loading annotations into memory... 2023-05-17 19:26:22.086 | INFO | torchvision.datasets.coco:init:36 - Done (t=0.00s) 2023-05-17 19:26:22.086 | INFO | pycocotools.coco:init:93 - creating index... 2023-05-17 19:26:22.086 | INFO | pycocotools.coco:init:93 - index created! 2023-05-17 19:26:22.087 | INFO | torchvision.datasets.coco:init:36 - loading annotations into memory... 2023-05-17 19:26:22.087 | INFO | torchvision.datasets.coco:init:36 - Done (t=0.00s) 2023-05-17 19:26:22.087 | INFO | pycocotools.coco:init:93 - creating index... 2023-05-17 19:26:22.087 | INFO | pycocotools.coco:init:93 - index created! 2023-05-17 19:29:21.082 | INFO | damo.apis.detector_trainer:train:266 - Model Summary: backbone's params(M): 2.36, flops(G): 8.74, latency(ms): 82.518 neck's params(M): 5.92, flops(G): 8.57, latency(ms): 178.938 head's params(M): 0.28, flops(G): 0.89, latency(ms): 38.145 total latency(ms): 345.674, total flops(G): 18.20, total params(M): 8.56

2023-05-17 19:29:21.425 | ERROR | main::66 - An error has been caught in function '', process 'MainProcess' (183951), thread 'MainThread' (139947174573888): Traceback (most recent call last):

File "tools/train.py", line 66, in main() └ <function main at 0x7f4739bb8280>

File "tools/train.py", line 62, in main trainer.train(local_rank=args.local_rank) │ │ │ └ 0 │ │ └ Namespace(config_file='configs/damoyolo_tinynasL20_T.py', local_rank=0, opts=[], tea_ckpt=None, tea_config=None) │ └ <function Trainer.train at 0x7f472bffe670> └ <damo.apis.detector_trainer.Trainer object at 0x7f472bfb3850>

File "/home/n.kotov1/colonoscopy/DAMO-YOLO/damo/apis/detector_trainer.py", line 270, in train self.model = build_ddp_model(self.model, local_rank) │ │ │ │ │ └ 0 │ │ │ │ └ Detector( │ │ │ │ (backbone): TinyNAS( │ │ │ │ (block_list): ModuleList( │ │ │ │ (0): Focus( │ │ │ │ (conv): ConvBNAct( │ │ │ │ (conv):... │ │ │ └ <damo.apis.detector_trainer.Trainer object at 0x7f472bfb3850> │ │ └ <function build_ddp_model at 0x7f472bffe040> │ └ Detector( │ (backbone): TinyNAS( │ (block_list): ModuleList( │ (0): Focus( │ (conv): ConvBNAct( │ (conv):... └ <damo.apis.detector_trainer.Trainer object at 0x7f472bfb3850>

File "/home/n.kotov1/colonoscopy/DAMO-YOLO/damo/detectors/detector.py", line 81, in build_ddp_model model = DDP(model, │ └ Detector( │ (backbone): TinyNAS( │ (block_list): ModuleList( │ (0): Focus( │ (conv): ConvBNAct( │ (conv):... └ <class 'torch.nn.parallel.distributed.DistributedDataParallel'>

File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 612, in init dist._verify_params_across_processes(self.process_group, parameters) │ │ │ │ └ [Parameter containing: │ │ │ │ tensor([[[[ 3.7181e-03, 1.0254e-02, -1.5696e-03], │ │ │ │ [-4.1219e-03, 3.4061e-02, 1.1072e-02], │ │ │ │ ... │ │ │ └ <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7f472bfbf630> │ │ └ DistributedDataParallel( │ │ (module): Detector( │ │ (backbone): TinyNAS( │ │ (block_list): ModuleList( │ │ (0): Focus( │ │ ... │ └ <built-in method _verify_params_across_processes of PyCapsule object at 0x7f473deb7540> └ <module 'torch.distributed' from '/opt/conda/lib/python3.8/site-packages/torch/distributed/init.py'>

RuntimeError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1227, invalid usage, NCCL version 21.1.4 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

I'm trying to train a model on custom data, but I get an error when I run it. The name of the device is forwarded correctly. I start training with the command "CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=8 tools/train.py -f configs/damoyolo_tinynasL20_T.py --local_rank 0".

To Reproduce

  1. customize configs for a single-class model (whatever it is).
  2. assign 1 device to start learning
  3. start training with the command "CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=8 tools/train.py -f configs/damoyolo_tinynasL20_T.py --local_rank 0"

Hyper-parameters/Configs

No response

Logs

No response

Screenshots

No response

Additional

No response

aminakmim commented 1 year ago

@KotovNikitaStudent How did you solve it ?

KotovNikitaStudent commented 1 year ago

It was necessary to formulate the command for launch differently - the command """CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 --node_rank=0 tools/train.py -f configs/damoyolo_tinynasL20_T.py""".