huawei-noah / Efficient-Computing

Efficient computing methods developed by Huawei Noah's Ark Lab
1.16k stars 201 forks source link

Does Gold-YOLO support single GPU during training phase? ? #68

Closed jacky22043 closed 4 months ago

lose4578 commented 10 months ago

Current code is supported, you should change the SyncBN to BN in model config, and you can start single GPU training with the following command:

python tools/train.py --batch 32 --conf configs/gold_yolo-s.py --data data/dataset.yaml --fuse_ab --device 0

jacky22043 commented 10 months ago

Following your steps, I change the config and entered the same commands as you but still got the error on Trainer.

Training start...
Epoch  iou_loss  dfl_loss  cls_loss
  0%|          | 0/41 [00:00<?, ?it/s]                                                              ERROR in training steps: Default process group has not been initialized, please make sure to call init_process_group.
  0%|          | 0/41 [00:01<?, ?it/s]
ERROR in training steps.
ERROR in training loop or eval/save model.
Traceback (most recent call last):
  File "tools/train.py", line 129, in <module>
    main(args)
  File "tools/train.py", line 119, in main
    trainer.train()
  File "D:\model\Efficient-Computing-master\Detection\Gold-YOLO\yolov6\core\engine.py", line 109, in train
    self.train_in_loop(self.epoch)
  File "D:\model\Efficient-Computing-master\Detection\Gold-YOLO\yolov6\core\engine.py", line 127, in train_in_loop
    self.print_details()
  File "D:\model\Efficient-Computing-master\Detection\Gold-YOLO\yolov6\core\engine.py", line 339, in print_details
    self.mean_loss = (self.mean_loss * self.step + self.loss_items) / (self.step + 1)
AttributeError: 'Trainer' object has no attribute 'loss_items'
lose4578 commented 10 months ago

Maybe somewhere still not corrected correctly, you should remove --use_syncbn in training commend, and change your model config like this:

# GoldYOLO-s model

use_checkpoint = False

model = dict(
        type='GoldYOLO-s',
        pretrained=None,
        depth_multiple=0.33,
        width_multiple=0.50,
        backbone=dict(
                type='EfficientRep',
                num_repeats=[1, 6, 12, 18, 6],
                out_channels=[64, 128, 256, 512, 1024],
                fuse_P2=True,
                cspsppf=True
        ),
        neck=dict(
                type='RepGDNeck',
                num_repeats=[12, 12, 12, 12],
                out_channels=[256, 128, 128, 256, 256, 512],
                extra_cfg=dict(
                        norm_cfg=dict(type='BN', requires_grad=True),   # Here, you should change the SyncBN to BN
                        depths=2,
                        fusion_in=960,
                        ppa_in=704,
                        fusion_act=dict(type='ReLU6'),
                        fuse_block_num=3,
                        embed_dim_p=128,
                        embed_dim_n=704,
                        key_dim=8,
                        num_heads=4,
                        mlp_ratios=1,
                        attn_ratios=2,
                        c2t_stride=2,
                        drop_path_rate=0.1,
                        trans_channels=[128, 64, 128, 256],
                        pool_mode='torch'
                )
        ),
        head=dict(
                type='EffiDeHead',
                in_channels=[128, 256, 512],
                num_layers=3,
                begin_indices=24,
                anchors=3,
                anchors_init=[[10, 13, 19, 19, 33, 23],
                              [30, 61, 59, 59, 59, 119],
                              [116, 90, 185, 185, 373, 326]],
                out_indices=[17, 20, 23],
                strides=[8, 16, 32],
                atss_warmup_epoch=0,
                iou_type='giou',
                use_dfl=True,  # set to True if you want to further train with distillation
                reg_max=16,  # set to 16 if you want to further train with distillation
                distill_weight={
                    'class': 1.0,
                    'dfl'  : 1.0,
                },
        )
)

solver = dict(
        optim='SGD',
        lr_scheduler='Cosine',
        lr0=0.01,
        lrf=0.01,
        momentum=0.937,
        weight_decay=0.0005,
        warmup_epochs=3.0,
        warmup_momentum=0.8,
        warmup_bias_lr=0.1
)

data_aug = dict(
        hsv_h=0.015,
        hsv_s=0.7,
        hsv_v=0.4,
        degrees=0.0,
        translate=0.1,
        scale=0.5,
        shear=0.0,
        flipud=0.0,
        fliplr=0.5,
        mosaic=1.0,
        mixup=0.0,
)
IronmanVsThanos commented 9 months ago

Maybe somewhere still not corrected correctly, you should remove --use_syncbn in training commend, and change your model config like this:

# GoldYOLO-s model

use_checkpoint = False

model = dict(
        type='GoldYOLO-s',
        pretrained=None,
        depth_multiple=0.33,
        width_multiple=0.50,
        backbone=dict(
                type='EfficientRep',
                num_repeats=[1, 6, 12, 18, 6],
                out_channels=[64, 128, 256, 512, 1024],
                fuse_P2=True,
                cspsppf=True
        ),
        neck=dict(
                type='RepGDNeck',
                num_repeats=[12, 12, 12, 12],
                out_channels=[256, 128, 128, 256, 256, 512],
                extra_cfg=dict(
                        norm_cfg=dict(type='BN', requires_grad=True),   # Here, you should change the SyncBN to BN
                        depths=2,
                        fusion_in=960,
                        ppa_in=704,
                        fusion_act=dict(type='ReLU6'),
                        fuse_block_num=3,
                        embed_dim_p=128,
                        embed_dim_n=704,
                        key_dim=8,
                        num_heads=4,
                        mlp_ratios=1,
                        attn_ratios=2,
                        c2t_stride=2,
                        drop_path_rate=0.1,
                        trans_channels=[128, 64, 128, 256],
                        pool_mode='torch'
                )
        ),
        head=dict(
                type='EffiDeHead',
                in_channels=[128, 256, 512],
                num_layers=3,
                begin_indices=24,
                anchors=3,
                anchors_init=[[10, 13, 19, 19, 33, 23],
                              [30, 61, 59, 59, 59, 119],
                              [116, 90, 185, 185, 373, 326]],
                out_indices=[17, 20, 23],
                strides=[8, 16, 32],
                atss_warmup_epoch=0,
                iou_type='giou',
                use_dfl=True,  # set to True if you want to further train with distillation
                reg_max=16,  # set to 16 if you want to further train with distillation
                distill_weight={
                    'class': 1.0,
                    'dfl'  : 1.0,
                },
        )
)

solver = dict(
        optim='SGD',
        lr_scheduler='Cosine',
        lr0=0.01,
        lrf=0.01,
        momentum=0.937,
        weight_decay=0.0005,
        warmup_epochs=3.0,
        warmup_momentum=0.8,
        warmup_bias_lr=0.1
)

data_aug = dict(
        hsv_h=0.015,
        hsv_s=0.7,
        hsv_v=0.4,
        degrees=0.0,
        translate=0.1,
        scale=0.5,
        shear=0.0,
        flipud=0.0,
        fliplr=0.5,
        mosaic=1.0,
        mixup=0.0,
)

when

Maybe somewhere still not corrected correctly, you should remove --use_syncbn in training commend, and change your model config like this:

# GoldYOLO-s model

use_checkpoint = False

model = dict(
        type='GoldYOLO-s',
        pretrained=None,
        depth_multiple=0.33,
        width_multiple=0.50,
        backbone=dict(
                type='EfficientRep',
                num_repeats=[1, 6, 12, 18, 6],
                out_channels=[64, 128, 256, 512, 1024],
                fuse_P2=True,
                cspsppf=True
        ),
        neck=dict(
                type='RepGDNeck',
                num_repeats=[12, 12, 12, 12],
                out_channels=[256, 128, 128, 256, 256, 512],
                extra_cfg=dict(
                        norm_cfg=dict(type='BN', requires_grad=True),   # Here, you should change the SyncBN to BN
                        depths=2,
                        fusion_in=960,
                        ppa_in=704,
                        fusion_act=dict(type='ReLU6'),
                        fuse_block_num=3,
                        embed_dim_p=128,
                        embed_dim_n=704,
                        key_dim=8,
                        num_heads=4,
                        mlp_ratios=1,
                        attn_ratios=2,
                        c2t_stride=2,
                        drop_path_rate=0.1,
                        trans_channels=[128, 64, 128, 256],
                        pool_mode='torch'
                )
        ),
        head=dict(
                type='EffiDeHead',
                in_channels=[128, 256, 512],
                num_layers=3,
                begin_indices=24,
                anchors=3,
                anchors_init=[[10, 13, 19, 19, 33, 23],
                              [30, 61, 59, 59, 59, 119],
                              [116, 90, 185, 185, 373, 326]],
                out_indices=[17, 20, 23],
                strides=[8, 16, 32],
                atss_warmup_epoch=0,
                iou_type='giou',
                use_dfl=True,  # set to True if you want to further train with distillation
                reg_max=16,  # set to 16 if you want to further train with distillation
                distill_weight={
                    'class': 1.0,
                    'dfl'  : 1.0,
                },
        )
)

solver = dict(
        optim='SGD',
        lr_scheduler='Cosine',
        lr0=0.01,
        lrf=0.01,
        momentum=0.937,
        weight_decay=0.0005,
        warmup_epochs=3.0,
        warmup_momentum=0.8,
        warmup_bias_lr=0.1
)

data_aug = dict(
        hsv_h=0.015,
        hsv_s=0.7,
        hsv_v=0.4,
        degrees=0.0,
        translate=0.1,
        scale=0.5,
        shear=0.0,
        flipud=0.0,
        fliplr=0.5,
        mosaic=1.0,
        mixup=0.0,
)

When I use gold-yolo-s, it runs normally, but when I use gold-yolo-n, Traceback (most recent call last) still appears: File "G:\Code\Gold-yolo\tools\train.py", line 132, in main(args) File "G:\Code\Gold-yolo\tools\train.py", line 122, in main trainer.train() File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 111, in train self.train_in_loop(self.epoch) File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 129, in train_in_loop self.print_details() File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 341, in print_details self.mean_loss = (self.mean_loss * self.step + self.loss_items) / (self.step + 1) AttributeError: 'Trainer' object has no attribute 'loss_items'

This means I need to modify the config of gold-yolo-n? If yes, how can I modify it?

lose4578 commented 9 months ago

Yes, you should modify the SyncBn to BN in gold-yolo-n config like gold-yolo-s

------------------ Original ------------------ From: IronmanVsThanos @.> Date: Wed,Sep 27,2023 5:51 PM To: 445326569 @.> Subject: Re: [huawei-noah/Efficient-Computing] Does Gold-YOLO support singleGPU during training phase? ? (Issue #68)

Maybe somewhere still not corrected correctly, you should remove --use_syncbn in training commend, and change your model config like this:

GoldYOLO-s model use_checkpoint = False model = dict( type='GoldYOLO-s', pretrained=None, depth_multiple=0.33, width_multiple=0.50, backbone=dict( type='EfficientRep', num_repeats=[1, 6, 12, 18, 6], out_channels=[64, 128, 256, 512, 1024], fuse_P2=True, cspsppf=True ), neck=dict( type='RepGDNeck', num_repeats=[12, 12, 12, 12], out_channels=[256, 128, 128, 256, 256, 512], extra_cfg=dict( norm_cfg=dict(type='BN', requires_grad=True), # Here, you should change the SyncBN to BN depths=2, fusion_in=960, ppa_in=704, fusion_act=dict(type='ReLU6'), fuse_block_num=3, embed_dim_p=128, embed_dim_n=704, key_dim=8, num_heads=4, mlp_ratios=1, attn_ratios=2, c2t_stride=2, drop_path_rate=0.1, trans_channels=[128, 64, 128, 256], pool_mode='torch' ) ), head=dict( type='EffiDeHead', in_channels=[128, 256, 512], num_layers=3, begin_indices=24, anchors=3, anchors_init=[[10, 13, 19, 19, 33, 23], [30, 61, 59, 59, 59, 119], [116, 90, 185, 185, 373, 326]], out_indices=[17, 20, 23], strides=[8, 16, 32], atss_warmup_epoch=0, iou_type='giou', use_dfl=True, # set to True if you want to further train with distillation reg_max=16, # set to 16 if you want to further train with distillation distill_weight={ 'class': 1.0, 'dfl' : 1.0, }, ) ) solver = dict( optim='SGD', lr_scheduler='Cosine', lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1 ) data_aug = dict( hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, )

when

Maybe somewhere still not corrected correctly, you should remove --use_syncbn in training commend, and change your model config like this:

GoldYOLO-s model use_checkpoint = False model = dict( type='GoldYOLO-s', pretrained=None, depth_multiple=0.33, width_multiple=0.50, backbone=dict( type='EfficientRep', num_repeats=[1, 6, 12, 18, 6], out_channels=[64, 128, 256, 512, 1024], fuse_P2=True, cspsppf=True ), neck=dict( type='RepGDNeck', num_repeats=[12, 12, 12, 12], out_channels=[256, 128, 128, 256, 256, 512], extra_cfg=dict( norm_cfg=dict(type='BN', requires_grad=True), # Here, you should change the SyncBN to BN depths=2, fusion_in=960, ppa_in=704, fusion_act=dict(type='ReLU6'), fuse_block_num=3, embed_dim_p=128, embed_dim_n=704, key_dim=8, num_heads=4, mlp_ratios=1, attn_ratios=2, c2t_stride=2, drop_path_rate=0.1, trans_channels=[128, 64, 128, 256], pool_mode='torch' ) ), head=dict( type='EffiDeHead', in_channels=[128, 256, 512], num_layers=3, begin_indices=24, anchors=3, anchors_init=[[10, 13, 19, 19, 33, 23], [30, 61, 59, 59, 59, 119], [116, 90, 185, 185, 373, 326]], out_indices=[17, 20, 23], strides=[8, 16, 32], atss_warmup_epoch=0, iou_type='giou', use_dfl=True, # set to True if you want to further train with distillation reg_max=16, # set to 16 if you want to further train with distillation distill_weight={ 'class': 1.0, 'dfl' : 1.0, }, ) ) solver = dict( optim='SGD', lr_scheduler='Cosine', lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1 ) data_aug = dict( hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, )

When I use gold-yolo-s, it runs normally, but when I use gold-yolo-n, Traceback (most recent call last) still appears: File "G:\Code\Gold-yolo\tools\train.py", line 132, in main(args) File "G:\Code\Gold-yolo\tools\train.py", line 122, in main trainer.train() File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 111, in train self.train_in_loop(self.epoch) File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 129, in train_in_loop self.print_details() File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 341, in print_details self.mean_loss = (self.mean_loss * self.step + self.loss_items) / (self.step + 1) AttributeError: 'Trainer' object has no attribute 'loss_items'

This means I need to modify the config of gold-yolo-n? If yes, how can I modify it?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

miaonaipeng commented 9 months ago

This is my train command, I use GPU 2. However, report out of memory GPU0.

python tools/train.py --batch 16 --conf configs/gold_yolo-s.py --data data/infrared_guanggang_yolo.yaml --fuse_ab --device 2

Training start...

 Epoch  iou_loss  dfl_loss  cls_loss

0%| | 0/25 [00:00<?, ?it/s] ERROR in training steps: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacty of 47.54 GiB of which 19.75 MiB is free. Process 2153272 has 46.27 GiB memory in use. Including non-PyTorch memory, this process has 1.23 GiB memory in use. Of the allocated memory 931.80 MiB is allocated by PyTorch, and 14.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 0/25 [00:00<?, ?it/s]
ERROR in training steps. ERROR in training loop or eval/save model. Traceback (most recent call last): File "tools/train.py", line 129, in main(args) File "tools/train.py", line 119, in main trainer.train() File "/home/mnp21/project/Efficient-Computing/Detection/Gold-YOLO/yolov6/core/engine.py", line 109, in train self.train_in_loop(self.epoch) File "/home/mnp21/project/Efficient-Computing/Detection/Gold-YOLO/yolov6/core/engine.py", line 127, in train_in_loop self.print_details() File "/home/mnp21/project/Efficient-Computing/Detection/Gold-YOLO/yolov6/core/engine.py", line 339, in print_details self.mean_loss = (self.mean_loss * self.step + self.loss_items) / (self.step + 1) AttributeError: 'Trainer' object has no attribute 'loss_items'

3232731490 commented 9 months ago

Yes, you should modify the SyncBn to BN in gold-yolo-n config like gold-yolo-s ------------------ Original ------------------ From: IronmanVsThanos @.> Date: Wed,Sep 27,2023 5:51 PM To: 445326569 @.> Subject: Re: [huawei-noah/Efficient-Computing] Does Gold-YOLO support singleGPU during training phase? ? (Issue #68) Maybe somewhere still not corrected correctly, you should remove --use_syncbn in training commend, and change your model config like this: # GoldYOLO-s model use_checkpoint = False model = dict( type='GoldYOLO-s', pretrained=None, depth_multiple=0.33, width_multiple=0.50, backbone=dict( type='EfficientRep', num_repeats=[1, 6, 12, 18, 6], out_channels=[64, 128, 256, 512, 1024], fuse_P2=True, cspsppf=True ), neck=dict( type='RepGDNeck', num_repeats=[12, 12, 12, 12], out_channels=[256, 128, 128, 256, 256, 512], extra_cfg=dict( norm_cfg=dict(type='BN', requires_grad=True), # Here, you should change the SyncBN to BN depths=2, fusion_in=960, ppa_in=704, fusion_act=dict(type='ReLU6'), fuse_block_num=3, embed_dim_p=128, embed_dim_n=704, key_dim=8, num_heads=4, mlp_ratios=1, attn_ratios=2, c2t_stride=2, drop_path_rate=0.1, trans_channels=[128, 64, 128, 256], pool_mode='torch' ) ), head=dict( type='EffiDeHead', in_channels=[128, 256, 512], num_layers=3, begin_indices=24, anchors=3, anchors_init=[[10, 13, 19, 19, 33, 23], [30, 61, 59, 59, 59, 119], [116, 90, 185, 185, 373, 326]], out_indices=[17, 20, 23], strides=[8, 16, 32], atss_warmup_epoch=0, iou_type='giou', use_dfl=True, # set to True if you want to further train with distillation reg_max=16, # set to 16 if you want to further train with distillation distill_weight={ 'class': 1.0, 'dfl' : 1.0, }, ) ) solver = dict( optim='SGD', lr_scheduler='Cosine', lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1 ) data_aug = dict( hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, ) when Maybe somewhere still not corrected correctly, you should remove --use_syncbn in training commend, and change your model config like this: # GoldYOLO-s model use_checkpoint = False model = dict( type='GoldYOLO-s', pretrained=None, depth_multiple=0.33, width_multiple=0.50, backbone=dict( type='EfficientRep', num_repeats=[1, 6, 12, 18, 6], out_channels=[64, 128, 256, 512, 1024], fuse_P2=True, cspsppf=True ), neck=dict( type='RepGDNeck', num_repeats=[12, 12, 12, 12], out_channels=[256, 128, 128, 256, 256, 512], extra_cfg=dict( norm_cfg=dict(type='BN', requires_grad=True), # Here, you should change the SyncBN to BN depths=2, fusion_in=960, ppa_in=704, fusion_act=dict(type='ReLU6'), fuse_block_num=3, embed_dim_p=128, embed_dim_n=704, key_dim=8, num_heads=4, mlp_ratios=1, attn_ratios=2, c2t_stride=2, drop_path_rate=0.1, trans_channels=[128, 64, 128, 256], pool_mode='torch' ) ), head=dict( type='EffiDeHead', in_channels=[128, 256, 512], num_layers=3, begin_indices=24, anchors=3, anchors_init=[[10, 13, 19, 19, 33, 23], [30, 61, 59, 59, 59, 119], [116, 90, 185, 185, 373, 326]], out_indices=[17, 20, 23], strides=[8, 16, 32], atss_warmup_epoch=0, iou_type='giou', use_dfl=True, # set to True if you want to further train with distillation reg_max=16, # set to 16 if you want to further train with distillation distill_weight={ 'class': 1.0, 'dfl' : 1.0, }, ) ) solver = dict( optim='SGD', lr_scheduler='Cosine', lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1 ) data_aug = dict( hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, ) When I use gold-yolo-s, it runs normally, but when I use gold-yolo-n, Traceback (most recent call last) still appears: File "G:\Code\Gold-yolo\tools\train.py", line 132, in main(args) File "G:\Code\Gold-yolo\tools\train.py", line 122, in main trainer.train() File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 111, in train self.train_in_loop(self.epoch) File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 129, in train_in_loop self.print_details() File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 341, in print_details self.mean_loss = (self.mean_loss * self.step + self.loss_items) / (self.step + 1) AttributeError: 'Trainer' object has no attribute 'loss_items' This means I need to modify the config of gold-yolo-n? If yes, how can I modify it? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Current code is supported, you should change the SyncBN to BN in model config, and you can start single GPU training with the following command:

python tools/train.py --batch 32 --conf configs/gold_yolo-s.py --data data/dataset.yaml --fuse_ab --device 0

Does not having enough GPU memory cause this problem?

lose4578 commented 9 months ago

This is my train command, I use GPU 2. However, report out of memory GPU0.

python tools/train.py --batch 16 --conf configs/gold_yolo-s.py --data data/infrared_guanggang_yolo.yaml --fuse_ab --device 2

Training start...

 Epoch  iou_loss  dfl_loss  cls_loss

0%| | 0/25 [00:00<?, ?it/s] ERROR in training steps: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacty of 47.54 GiB of which 19.75 MiB is free. Process 2153272 has 46.27 GiB memory in use. Including non-PyTorch memory, this process has 1.23 GiB memory in use. Of the allocated memory 931.80 MiB is allocated by PyTorch, and 14.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 0/25 [00:00<?, ?it/s] ERROR in training steps. ERROR in training loop or eval/save model. Traceback (most recent call last): File "tools/train.py", line 129, in main(args) File "tools/train.py", line 119, in main trainer.train() File "/home/mnp21/project/Efficient-Computing/Detection/Gold-YOLO/yolov6/core/engine.py", line 109, in train self.train_in_loop(self.epoch) File "/home/mnp21/project/Efficient-Computing/Detection/Gold-YOLO/yolov6/core/engine.py", line 127, in train_in_loop self.print_details() File "/home/mnp21/project/Efficient-Computing/Detection/Gold-YOLO/yolov6/core/engine.py", line 339, in print_details self.mean_loss = (self.mean_loss * self.step + self.loss_items) / (self.step + 1) AttributeError: 'Trainer' object has no attribute 'loss_items'

You can reduce the batchsize for training.

lose4578 commented 9 months ago

Yes, you should modify the SyncBn to BN in gold-yolo-n config like gold-yolo-s ------------------ Original ------------------ From: IronmanVsThanos @.**> Date: Wed,Sep 27,2023 5:51 PM To: 445326569 @.**> Subject: Re: [huawei-noah/Efficient-Computing] Does Gold-YOLO support singleGPU during training phase? ? (Issue #68) Maybe somewhere still not corrected correctly, you should remove --use_syncbn in training commend, and change your model config like this: # GoldYOLO-s model use_checkpoint = False model = dict( type='GoldYOLO-s', pretrained=None, depth_multiple=0.33, width_multiple=0.50, backbone=dict( type='EfficientRep', num_repeats=[1, 6, 12, 18, 6], out_channels=[64, 128, 256, 512, 1024], fuse_P2=True, cspsppf=True ), neck=dict( type='RepGDNeck', num_repeats=[12, 12, 12, 12], out_channels=[256, 128, 128, 256, 256, 512], extra_cfg=dict( norm_cfg=dict(type='BN', requires_grad=True), # Here, you should change the SyncBN to BN depths=2, fusion_in=960, ppa_in=704, fusion_act=dict(type='ReLU6'), fuse_block_num=3, embed_dim_p=128, embed_dim_n=704, key_dim=8, num_heads=4, mlp_ratios=1, attn_ratios=2, c2t_stride=2, drop_path_rate=0.1, trans_channels=[128, 64, 128, 256], pool_mode='torch' ) ), head=dict( type='EffiDeHead', in_channels=[128, 256, 512], num_layers=3, begin_indices=24, anchors=3, anchors_init=[[10, 13, 19, 19, 33, 23], [30, 61, 59, 59, 59, 119], [116, 90, 185, 185, 373, 326]], out_indices=[17, 20, 23], strides=[8, 16, 32], atss_warmup_epoch=0, iou_type='giou', use_dfl=True, # set to True if you want to further train with distillation reg_max=16, # set to 16 if you want to further train with distillation distill_weight={ 'class': 1.0, 'dfl' : 1.0, }, ) ) solver = dict( optim='SGD', lr_scheduler='Cosine', lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1 ) data_aug = dict( hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, ) when Maybe somewhere still not corrected correctly, you should remove --use_syncbn in training commend, and change your model config like this: # GoldYOLO-s model use_checkpoint = False model = dict( type='GoldYOLO-s', pretrained=None, depth_multiple=0.33, width_multiple=0.50, backbone=dict( type='EfficientRep', num_repeats=[1, 6, 12, 18, 6], out_channels=[64, 128, 256, 512, 1024], fuse_P2=True, cspsppf=True ), neck=dict( type='RepGDNeck', num_repeats=[12, 12, 12, 12], out_channels=[256, 128, 128, 256, 256, 512], extra_cfg=dict( norm_cfg=dict(type='BN', requires_grad=True), # Here, you should change the SyncBN to BN depths=2, fusion_in=960, ppa_in=704, fusion_act=dict(type='ReLU6'), fuse_block_num=3, embed_dim_p=128, embed_dim_n=704, key_dim=8, num_heads=4, mlp_ratios=1, attn_ratios=2, c2t_stride=2, drop_path_rate=0.1, trans_channels=[128, 64, 128, 256], pool_mode='torch' ) ), head=dict( type='EffiDeHead', in_channels=[128, 256, 512], num_layers=3, begin_indices=24, anchors=3, anchors_init=[[10, 13, 19, 19, 33, 23], [30, 61, 59, 59, 59, 119], [116, 90, 185, 185, 373, 326]], out_indices=[17, 20, 23], strides=[8, 16, 32], atss_warmup_epoch=0, iou_type='giou', use_dfl=True, # set to True if you want to further train with distillation reg_max=16, # set to 16 if you want to further train with distillation distill_weight={ 'class': 1.0, 'dfl' : 1.0, }, ) ) solver = dict( optim='SGD', lr_scheduler='Cosine', lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1 ) data_aug = dict( hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, ) When I use gold-yolo-s, it runs normally, but when I use gold-yolo-n, Traceback (most recent call last) still appears: File "G:\Code\Gold-yolo\tools\train.py", line 132, in main(args) File "G:\Code\Gold-yolo\tools\train.py", line 122, in main trainer.train() File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 111, in train self.train_in_loop(self.epoch) File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 129, in train_in_loop self.print_details() File "G:\Code\Gold-yolo\yolov6\core\engine.py", line 341, in print_details self.mean_loss = (self.mean_loss * self.step + self.loss_items) / (self.step + 1) AttributeError: 'Trainer' object has no attribute 'lossitems' This means I need to modify the config of gold-yolo-n? If yes, how can I modify it? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @_.***>

Current code is supported, you should change the SyncBN to BN in model config, and you can start single GPU training with the following command: python tools/train.py --batch 32 --conf configs/gold_yolo-s.py --data data/dataset.yaml --fuse_ab --device 0

Not always. You should check your error log, and to find are any errors related to insufficient GPU memory