mindspore-lab / mindyolo

A toolbox of yolo models and algorithms based on MindSpore
Apache License 2.0
91 stars 39 forks source link

使用GPU训练yolov8-seg,环境为mindspore2.0,Loss不收敛。为什么?是不支持GPU训练吗? #294

Closed wcpcp closed 1 month ago

wcpcp commented 4 months ago

使用GPU训练yolov8-seg,环境为mindspore2.0,Loss不收敛。我已经检查了输入和标签是正常的,但是Loss不收敛。 93382fe7c1fd490f3a9e6df10601a6d

wcpcp commented 4 months ago

2024-05-22 03:30:39,391 [INFO] parse_args: 2024-05-22 03:30:39,391 [INFO] task segment 2024-05-22 03:30:39,391 [INFO] device_target GPU 2024-05-22 03:30:39,391 [INFO] save_dir ./runs/2024.05.22-03.30.39 2024-05-22 03:30:39,391 [INFO] log_level INFO 2024-05-22 03:30:39,391 [INFO] is_parallel False 2024-05-22 03:30:39,391 [INFO] ms_mode 0 2024-05-22 03:30:39,391 [INFO] ms_amp_level O0 2024-05-22 03:30:39,391 [INFO] keep_loss_fp32 True 2024-05-22 03:30:39,391 [INFO] ms_loss_scaler static 2024-05-22 03:30:39,391 [INFO] ms_loss_scaler_value 1024.0 2024-05-22 03:30:39,391 [INFO] ms_jit True 2024-05-22 03:30:39,391 [INFO] ms_enable_graph_kernel False 2024-05-22 03:30:39,391 [INFO] ms_datasink False 2024-05-22 03:30:39,391 [INFO] overflow_still_update True 2024-05-22 03:30:39,391 [INFO] clip_grad True 2024-05-22 03:30:39,391 [INFO] clip_grad_value 10.0 2024-05-22 03:30:39,391 [INFO] ema True 2024-05-22 03:30:39,391 [INFO] weight
2024-05-22 03:30:39,391 [INFO] ema_weight
2024-05-22 03:30:39,391 [INFO] freeze [] 2024-05-22 03:30:39,391 [INFO] epochs 300 2024-05-22 03:30:39,391 [INFO] per_batch_size 16 2024-05-22 03:30:39,391 [INFO] img_size 640 2024-05-22 03:30:39,391 [INFO] nbs 64 2024-05-22 03:30:39,391 [INFO] accumulate 1 2024-05-22 03:30:39,391 [INFO] auto_accumulate False 2024-05-22 03:30:39,391 [INFO] log_interval 100 2024-05-22 03:30:39,391 [INFO] single_cls False 2024-05-22 03:30:39,391 [INFO] sync_bn False 2024-05-22 03:30:39,391 [INFO] keep_checkpoint_max 100 2024-05-22 03:30:39,391 [INFO] run_eval False 2024-05-22 03:30:39,391 [INFO] conf_thres 0.001 2024-05-22 03:30:39,391 [INFO] iou_thres 0.7 2024-05-22 03:30:39,391 [INFO] conf_free True 2024-05-22 03:30:39,391 [INFO] rect False 2024-05-22 03:30:39,391 [INFO] nms_time_limit 20.0 2024-05-22 03:30:39,391 [INFO] recompute True 2024-05-22 03:30:39,391 [INFO] recompute_layers 2 2024-05-22 03:30:39,391 [INFO] seed 2 2024-05-22 03:30:39,391 [INFO] summary True 2024-05-22 03:30:39,391 [INFO] profiler False 2024-05-22 03:30:39,391 [INFO] profiler_step_num 1 2024-05-22 03:30:39,391 [INFO] opencv_threads_num 0 2024-05-22 03:30:39,391 [INFO] strict_load True 2024-05-22 03:30:39,391 [INFO] enable_modelarts False 2024-05-22 03:30:39,391 [INFO] data_url
2024-05-22 03:30:39,391 [INFO] ckpt_url
2024-05-22 03:30:39,391 [INFO] multi_data_url
2024-05-22 03:30:39,391 [INFO] pretrain_url
2024-05-22 03:30:39,391 [INFO] train_url
2024-05-22 03:30:39,391 [INFO] data_dir /split 2024-05-22 03:30:39,391 [INFO] ckpt_dir /cache/pretrain_ckpt/ 2024-05-22 03:30:39,391 [INFO] data.dataset_name coco 2024-05-22 03:30:39,391 [INFO] data.train_set /split/images/train 2024-05-22 03:30:39,391 [INFO] data.val_set /split/images/val 2024-05-22 03:30:39,391 [INFO] data.test_set /split/images/test 2024-05-22 03:30:39,391 [INFO] data.nc 1 2024-05-22 03:30:39,391 [INFO] data.names ['dm'] 2024-05-22 03:30:39,391 [INFO] train_transforms.stage_epochs [300] 2024-05-22 03:30:39,391 [INFO] train_transforms.trans_list [[{'func_name': 'resample_segments'}, {'func_name': 'letterbox', 'scaleup': True}, {'func_name': 'hsv_augment', 'prob': 1.0, 'hgain': 0.015, 'sgain': 0.7, 'vgain': 0.4}, {'func_name': 'fliplr', 'prob': 0.5}, {'func_name': 'segment_poly2mask', 'mask_overlap': True, 'mask_ratio': 4}, {'func_name': 'labelnorm', 'xyxy2xywh': True}, {'func_name': 'label_pad', 'padding_size': 160, 'padding_value': -1}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}]] 2024-05-22 03:30:39,391 [INFO] data.test_transforms [{'func_name': 'letterbox', 'scaleup': False}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}] 2024-05-22 03:30:39,391 [INFO] data.num_parallel_workers 4 2024-05-22 03:30:39,391 [INFO] network.model_name yolov8 2024-05-22 03:30:39,391 [INFO] network.nc 1 2024-05-22 03:30:39,391 [INFO] network.reg_max 16 2024-05-22 03:30:39,391 [INFO] network.stride [8, 16, 32] 2024-05-22 03:30:39,391 [INFO] network.backbone [[-1, 1, 'ConvNormAct', [64, 3, 2]], [-1, 1, 'ConvNormAct', [128, 3, 2]], [-1, 3, 'C2f', [128, True]], [-1, 1, 'ConvNormAct', [256, 3, 2]], [-1, 6, 'C2f', [256, True]], [-1, 1, 'ConvNormAct', [512, 3, 2]], [-1, 6, 'C2f', [512, True]], [-1, 1, 'ConvNormAct', [1024, 3, 2]], [-1, 3, 'C2f', [1024, True]], [-1, 1, 'SPPF', [1024, 5]]] 2024-05-22 03:30:39,391 [INFO] network.head [[-1, 1, 'Upsample', ['None', 2, 'nearest']], [[-1, 6], 1, 'Concat', [1]], [-1, 3, 'C2f', [512]], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [[-1, 4], 1, 'Concat', [1]], [-1, 3, 'C2f', [256]], [-1, 1, 'ConvNormAct', [256, 3, 2]], [[-1, 12], 1, 'Concat', [1]], [-1, 3, 'C2f', [512]], [-1, 1, 'ConvNormAct', [512, 3, 2]], [[-1, 9], 1, 'Concat', [1]], [-1, 3, 'C2f', [1024]], [[15, 18, 21], 1, 'YOLOv8SegHead', ['nc', 'reg_max', 32, 256, 'stride']]] 2024-05-22 03:30:39,391 [INFO] network.depth_multiple 1.0 2024-05-22 03:30:39,391 [INFO] network.width_multiple 1.25 2024-05-22 03:30:39,391 [INFO] network.max_channels 512 2024-05-22 03:30:39,391 [INFO] optimizer.optimizer momentum 2024-05-22 03:30:39,391 [INFO] optimizer.lr_init 0.01 2024-05-22 03:30:39,391 [INFO] optimizer.momentum 0.937 2024-05-22 03:30:39,391 [INFO] optimizer.nesterov True 2024-05-22 03:30:39,391 [INFO] optimizer.loss_scale 1.0 2024-05-22 03:30:39,391 [INFO] optimizer.warmup_epochs 3 2024-05-22 03:30:39,391 [INFO] optimizer.warmup_momentum 0.8 2024-05-22 03:30:39,391 [INFO] optimizer.warmup_bias_lr 0.1 2024-05-22 03:30:39,391 [INFO] optimizer.min_warmup_step 1000 2024-05-22 03:30:39,391 [INFO] optimizer.group_param yolov8 2024-05-22 03:30:39,391 [INFO] optimizer.gp_weight_decay 0.0010078125 2024-05-22 03:30:39,391 [INFO] optimizer.start_factor 1.0 2024-05-22 03:30:39,391 [INFO] optimizer.end_factor 0.01 2024-05-22 03:30:39,391 [INFO] optimizer.epochs 300 2024-05-22 03:30:39,391 [INFO] optimizer.nbs 64 2024-05-22 03:30:39,391 [INFO] optimizer.accumulate 1 2024-05-22 03:30:39,391 [INFO] optimizer.total_batch_size 16 2024-05-22 03:30:39,391 [INFO] loss.name YOLOv8SegLoss 2024-05-22 03:30:39,391 [INFO] loss.box 7.5 2024-05-22 03:30:39,391 [INFO] loss.cls 0.5 2024-05-22 03:30:39,391 [INFO] loss.dfl 1.5 2024-05-22 03:30:39,391 [INFO] loss.reg_max 16 2024-05-22 03:30:39,391 [INFO] loss.nm 32 2024-05-22 03:30:39,391 [INFO] loss.overlap True 2024-05-22 03:30:39,391 [INFO] loss.max_object_num 600 2024-05-22 03:30:39,391 [INFO] config ./configs/yolov8/seg/yolov8x-seg.yaml 2024-05-22 03:30:39,391 [INFO] rank 0 2024-05-22 03:30:39,391 [INFO] rank_size 1 2024-05-22 03:30:39,391 [INFO] total_batch_size 16 2024-05-22 03:30:39,391 [INFO] callback [] 2024-05-22 03:30:39,391 [INFO] 2024-05-22 03:30:39,393 [INFO] Please check the above information for the configurations 2024-05-22 03:30:39,832 [WARNING] Parse Model, args: nearest, keep str type 2024-05-22 03:30:39,893 [WARNING] Parse Model, args: nearest, keep str type 2024-05-22 03:30:40,340 [INFO] number of network params, total: 71.812198M, trainable: 71.751795M 2024-05-22 03:30:40,348 [INFO] Turn on recompute, and the results of the first 2 layers will be recomputed. 2024-05-22 03:30:45,855 [WARNING] Parse Model, args: nearest, keep str type 2024-05-22 03:30:45,946 [WARNING] Parse Model, args: nearest, keep str type 2024-05-22 03:30:46,698 [INFO] number of network params, total: 71.812198M, trainable: 71.751795M 2024-05-22 03:30:46,706 [INFO] Turn on recompute, and the results of the first 2 layers will be recomputed. 2024-05-22 03:30:52,107 [INFO] ema_weight not exist, default pretrain weight is currently used. 2024-05-22 03:30:52,414 [INFO] Dataset Cache file hash/version check success. 2024-05-22 03:30:52,414 [INFO] Load dataset cache from [/labels/train.cache.npy] success. Scanning '/split/labels/train.cache.npy' images and labels... 1148 found, 0 missing, 1 empty, 0 corrup 2024-05-22 03:30:52,422 [INFO] Dataloader num parallel workers: [4] 2024-05-22 03:30:54,685 [INFO] Registry(name=callback, total=4) 2024-05-22 03:30:54,685 [INFO] (0): YoloxSwitchTrain in mindyolo/utils/callback.py 2024-05-22 03:30:54,685 [INFO] (1): EvalWhileTrain in mindyolo/utils/callback.py 2024-05-22 03:30:54,685 [INFO] (2): SummaryCallback in mindyolo/utils/callback.py 2024-05-22 03:30:54,685 [INFO] (3): ProfilerCallback in mindyolo/utils/callback.py 2024-05-22 03:30:54,685 [INFO] 2024-05-22 03:30:55,212 [INFO] got 1 active callback as follows: 2024-05-22 03:30:55,214 [INFO] SummaryCallback()

WongGawa commented 1 month ago

感谢您的反馈,mindyolo目前仅在Ascend做了验证,还没有在GPU/CPU验证过,如验证后,我们会在issue中更新通知。

Hucley commented 1 month ago

使用华为910B,从coco切换到自有数据集,也出现了上述情况

WongGawa commented 1 month ago

@Hucley 感谢反馈,可以的话请提供一下MindSpore版本信息,MindYolo版本信息。