mindspore-lab / mindyolo

A toolbox of yolo models and algorithms based on MindSpore
Apache License 2.0
101 stars 42 forks source link

训练yolov7在Ascend910p出错 #168

Closed xunfeng2zkj closed 3 months ago

xunfeng2zkj commented 1 year ago

Environment

Hardware Environment(Ascend/GPU/CPU):

Uncomment only one /device <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/device ascend
/device gpu
/device cpu

Software Environment:

Describe the current behavior

Describe the expected behavior

Steps to reproduce the issue

Related log / screenshot

Special notes for this issue

python3 train.py -c configs/yolov7/yolov7.yaml 2023-07-11 12:30:32,961 [INFO] parse_args: 2023-07-11 12:30:32,961 [INFO] device_target Ascend 2023-07-11 12:30:32,961 [INFO] save_dir ./runs/2023.07.11-12.30.32 2023-07-11 12:30:32,961 [INFO] device_per_servers 8 2023-07-11 12:30:32,961 [INFO] log_level INFO 2023-07-11 12:30:32,961 [INFO] is_parallel False 2023-07-11 12:30:32,961 [INFO] ms_mode 0 2023-07-11 12:30:32,961 [INFO] ms_amp_level O0 2023-07-11 12:30:32,961 [INFO] keep_loss_fp32 True 2023-07-11 12:30:32,961 [INFO] ms_loss_scaler static 2023-07-11 12:30:32,961 [INFO] ms_loss_scaler_value 1024.0 2023-07-11 12:30:32,961 [INFO] ms_grad_sens 1024.0 2023-07-11 12:30:32,961 [INFO] ms_jit True 2023-07-11 12:30:32,961 [INFO] ms_enable_graph_kernel False 2023-07-11 12:30:32,961 [INFO] ms_datasink False 2023-07-11 12:30:32,961 [INFO] overflow_still_update True 2023-07-11 12:30:32,961 [INFO] ema True 2023-07-11 12:30:32,961 [INFO] weight
2023-07-11 12:30:32,961 [INFO] ema_weight
2023-07-11 12:30:32,961 [INFO] freeze [] 2023-07-11 12:30:32,961 [INFO] epochs 300 2023-07-11 12:30:32,961 [INFO] per_batch_size 16 2023-07-11 12:30:32,961 [INFO] img_size 640 2023-07-11 12:30:32,961 [INFO] nbs 64 2023-07-11 12:30:32,961 [INFO] accumulate 1 2023-07-11 12:30:32,961 [INFO] auto_accumulate False 2023-07-11 12:30:32,961 [INFO] log_interval 100 2023-07-11 12:30:32,961 [INFO] single_cls False 2023-07-11 12:30:32,961 [INFO] sync_bn False 2023-07-11 12:30:32,961 [INFO] keep_checkpoint_max 100 2023-07-11 12:30:32,961 [INFO] run_eval False 2023-07-11 12:30:32,961 [INFO] conf_thres 0.001 2023-07-11 12:30:32,961 [INFO] iou_thres 0.65 2023-07-11 12:30:32,961 [INFO] conf_free False 2023-07-11 12:30:32,961 [INFO] rect False 2023-07-11 12:30:32,961 [INFO] nms_time_limit 20.0 2023-07-11 12:30:32,961 [INFO] recompute True 2023-07-11 12:30:32,961 [INFO] recompute_layers 5 2023-07-11 12:30:32,961 [INFO] seed 2 2023-07-11 12:30:32,961 [INFO] summary True 2023-07-11 12:30:32,961 [INFO] profiler False 2023-07-11 12:30:32,961 [INFO] profiler_step_num 1 2023-07-11 12:30:32,961 [INFO] opencv_threads_num 2 2023-07-11 12:30:32,961 [INFO] enable_modelarts False 2023-07-11 12:30:32,961 [INFO] data_url
2023-07-11 12:30:32,961 [INFO] ckpt_url
2023-07-11 12:30:32,961 [INFO] multi_data_url
2023-07-11 12:30:32,961 [INFO] pretrain_url
2023-07-11 12:30:32,961 [INFO] train_url
2023-07-11 12:30:32,961 [INFO] data_dir /cache/data/ 2023-07-11 12:30:32,961 [INFO] ckpt_dir /cache/pretrain_ckpt/ 2023-07-11 12:30:32,961 [INFO] data.path /home/ma-user/work/ 2023-07-11 12:30:32,961 [INFO] data.train_set /home/ma-user/work/night_car/car_train.txt 2023-07-11 12:30:32,961 [INFO] data.val_set /home/ma-user/work/night_car/car_val.txt 2023-07-11 12:30:32,961 [INFO] data.test_set /home/ma-user/work/night_car/car_val.txt 2023-07-11 12:30:32,961 [INFO] data.nc 1 2023-07-11 12:30:32,961 [INFO] data.names ['car'] 2023-07-11 12:30:32,961 [INFO] data.dataset_name coco 2023-07-11 12:30:32,961 [INFO] data.train_transforms [{'func_name': 'mosaic', 'prob': 1.0, 'mosaic9_prob': 0.2, 'translate': 0.2, 'scale': 0.9}, {'func_name': 'mixup', 'prob': 0.15, 'alpha': 8.0, 'beta': 8.0, 'needed_mosaic': True}, {'func_name': 'hsv_augment', 'prob': 1.0, 'hgain': 0.015, 'sgain': 0.7, 'vgain': 0.4}, {'func_name': 'pastein', 'prob': 0.15, 'num_sample': 30}, {'func_name': 'labelnorm', 'xyxy2xywh': True}, {'func_name': 'fliplr', 'prob': 0.5}, {'func_name': 'label_pad', 'padding_size': 160, 'padding_value': -1}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}] 2023-07-11 12:30:32,961 [INFO] data.test_transforms [{'func_name': 'letterbox', 'scaleup': False}, {'func_name': 'labelnorm', 'xyxy2xywh': True}, {'func_name': 'label_pad', 'padding_size': 160, 'padding_value': -1}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}] 2023-07-11 12:30:32,961 [INFO] data.num_parallel_workers 4 2023-07-11 12:30:32,961 [INFO] optimizer.optimizer momentum 2023-07-11 12:30:32,961 [INFO] optimizer.lr_init 0.01 2023-07-11 12:30:32,961 [INFO] optimizer.momentum 0.937 2023-07-11 12:30:32,961 [INFO] optimizer.nesterov True 2023-07-11 12:30:32,961 [INFO] optimizer.loss_scale 1.0 2023-07-11 12:30:32,961 [INFO] optimizer.warmup_epochs 3 2023-07-11 12:30:32,961 [INFO] optimizer.warmup_momentum 0.8 2023-07-11 12:30:32,961 [INFO] optimizer.warmup_bias_lr 0.1 2023-07-11 12:30:32,961 [INFO] optimizer.min_warmup_step 1000 2023-07-11 12:30:32,961 [INFO] optimizer.group_param yolov7 2023-07-11 12:30:32,961 [INFO] optimizer.gp_weight_decay 0.0005 2023-07-11 12:30:32,961 [INFO] optimizer.start_factor 1.0 2023-07-11 12:30:32,961 [INFO] optimizer.end_factor 0.1 2023-07-11 12:30:32,961 [INFO] optimizer.epochs 300 2023-07-11 12:30:32,961 [INFO] optimizer.nbs 64 2023-07-11 12:30:32,961 [INFO] optimizer.accumulate 1 2023-07-11 12:30:32,961 [INFO] optimizer.total_batch_size 16 2023-07-11 12:30:32,961 [INFO] loss.name YOLOv7Loss 2023-07-11 12:30:32,961 [INFO] loss.box 0.05 2023-07-11 12:30:32,961 [INFO] loss.cls 0.3 2023-07-11 12:30:32,961 [INFO] loss.cls_pw 1.0 2023-07-11 12:30:32,961 [INFO] loss.obj 0.7 2023-07-11 12:30:32,961 [INFO] loss.obj_pw 1.0 2023-07-11 12:30:32,961 [INFO] loss.fl_gamma 0.0 2023-07-11 12:30:32,961 [INFO] loss.anchor_t 4.0 2023-07-11 12:30:32,961 [INFO] loss.label_smoothing 0.0 2023-07-11 12:30:32,961 [INFO] network.model_name yolov7 2023-07-11 12:30:32,961 [INFO] network.depth_multiple 1.0 2023-07-11 12:30:32,961 [INFO] network.width_multiple 1.0 2023-07-11 12:30:32,961 [INFO] network.stride [8, 16, 32] 2023-07-11 12:30:32,961 [INFO] network.anchors [[12, 16, 19, 36, 40, 28], [36, 75, 76, 55, 72, 146], [142, 110, 192, 243, 459, 401]] 2023-07-11 12:30:32,961 [INFO] network.backbone [[-1, 1, 'ConvNormAct', [32, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 2]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 2]], [-1, 1, 'ConvNormAct', [64, 1, 1]], [-2, 1, 'ConvNormAct', [64, 1, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-3, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 2]], [[-1, -3], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-2, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [512, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-3, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 2]], [[-1, -3], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [1024, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [512, 1, 1]], [-3, 1, 'ConvNormAct', [512, 1, 1]], [-1, 1, 'ConvNormAct', [512, 3, 2]], [[-1, -3], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [1024, 1, 1]]] 2023-07-11 12:30:32,961 [INFO] network.head [[-1, 1, 'SPPCSPC', [512]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [37, 1, 'ConvNormAct', [256, 1, 1]], [[-1, -2], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [24, 1, 'ConvNormAct', [128, 1, 1]], [[-1, -2], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-2, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-3, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 2]], [[-1, -3, 63], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-3, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 2]], [[-1, -3, 51], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [512, 1, 1]], [-2, 1, 'ConvNormAct', [512, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [512, 1, 1]], [75, 1, 'RepConv', [256, 3, 1]], [88, 1, 'RepConv', [512, 3, 1]], [101, 1, 'RepConv', [1024, 3, 1]], [[102, 103, 104], 1, 'YOLOv7Head', ['nc', 'anchors', 'stride']]] 2023-07-11 12:30:32,961 [INFO] config configs/yolov7/yolov7.yaml 2023-07-11 12:30:32,961 [INFO] rank 0 2023-07-11 12:30:32,961 [INFO] rank_size 1 2023-07-11 12:30:32,961 [INFO] total_batch_size 16 2023-07-11 12:30:32,961 [INFO] callback [] 2023-07-11 12:30:32,961 [INFO] 2023-07-11 12:30:32,963 [INFO] Please check the above information for the configurations 2023-07-11 12:30:33,910 [WARNING] Parse Model, args: nearest, keep str type 2023-07-11 12:30:34,007 [WARNING] Parse Model, args: nearest, keep str type 2023-07-11 12:30:34,410 [INFO] number of network params, total: 37.246339M, trainable: 37.196556M 2023-07-11 12:30:34,422 [INFO] Turn on recompute, and the results of the first 5 layers will be recomputed. 2023-07-11 12:30:54,044 [WARNING] Parse Model, args: nearest, keep str type 2023-07-11 12:30:54,141 [WARNING] Parse Model, args: nearest, keep str type 2023-07-11 12:30:54,554 [INFO] number of network params, total: 37.246339M, trainable: 37.196556M 2023-07-11 12:30:54,566 [INFO] Turn on recompute, and the results of the first 5 layers will be recomputed.

TotalTime = 12.5261, [16]

[symbol_resolve]: 0.0445111, [1] [Cycle 1]: 0.0444429, [1]

[optimize]: 0.267299, [22]

[simplify_data_structures]: 0.00175699
[opt_a]: 0.225315, [2]
    [Cycle 1]: 0.189621, [26]
        [expand_dump_flag]: 2.129e-05
        [switch_simplify]: 0.00170734
        [a_1]: 0.0976232
        [recompute_prepare]: 0.000858418
        [updatestate_depend_eliminate]: 0.00873714
        [updatestate_assign_eliminate]: 0.00803381
        [updatestate_loads_eliminate]: 0.000671716
        [parameter_eliminate]: 4.80999e-06
        [a_2]: 0.00837349
        [accelerated_algorithm]: 0.000756657
        [pynative_shard]: 2.82e-06
        [auto_parallel]: 6.34999e-06
        [parallel]: 2.007e-05
        [allreduce_fusion]: 0.000251762
        [virtual_dataset]: 0.000502855
        [get_grad_eliminate_]: 0.000457934
        [virtual_output]: 0.000458515
        [meta_fg_expand]: 0.00116805
        [after_resolve]: 0.00144858
        [a_after_grad]: 0.000627087
        [renormalize]: 0.044053
        [real_op_eliminate]: 0.000615936
        [auto_monad_grad]: 6.74999e-06
        [auto_monad_eliminator]: 0.00226781
        [cse]: 0.00587297
        [a_3]: 0.0046758
    [Cycle 2]: 0.0336986, [26]
        [expand_dump_flag]: 2.83e-06
        [switch_simplify]: 0.000454435
        [a_1]: 0.00910322
        [recompute_prepare]: 0.000364614
        [updatestate_depend_eliminate]: 0.000308563
        [updatestate_assign_eliminate]: 0.000389534
        [updatestate_loads_eliminate]: 0.000425564
        [parameter_eliminate]: 3.24e-06
        [a_2]: 0.00805789
        [accelerated_algorithm]: 0.000751357
        [pynative_shard]: 2.22001e-06
        [auto_parallel]: 4.88e-06
        [parallel]: 3.75e-06
        [allreduce_fusion]: 0.000214552
        [virtual_dataset]: 0.000485115
        [get_grad_eliminate_]: 0.000441414
        [virtual_output]: 0.000442174
        [meta_fg_expand]: 0.000825678
        [after_resolve]: 0.00129374
        [a_after_grad]: 0.000579015
        [renormalize]: 2.00002e-07
        [real_op_eliminate]: 0.000440745
        [auto_monad_grad]: 3.28e-06
        [auto_monad_eliminator]: 0.0019583
        [cse]: 0.00241402
        [a_3]: 0.00446869
[item_dict_eliminate_after_opt_a]: 0.00123497, [1]
    [Cycle 1]: 0.00122313, [2]
        [mutable_eliminate]: 0.000456055
        [item_dict_eliminate]: 0.000746987
[clean_after_opta]: 0.000492505
[opt_b]: 0.0161969, [1]
    [Cycle 1]: 0.0161843, [7]
        [b_1]: 0.0124879
        [b_2]: 0.000552775
        [updatestate_depend_eliminate]: 0.000305804
        [updatestate_assign_eliminate]: 0.000387884
        [updatestate_loads_eliminate]: 0.000426934
        [renormalize]: 6.50005e-07
        [cse]: 0.0019323
[cconv]: 0.000306543
[opt_after_cconv]: 0.00515957, [1]
    [Cycle 1]: 0.00514809, [6]
        [c_1]: 0.00202726
        [updatestate_depend_eliminate]: 0.000309404
        [updatestate_assign_eliminate]: 0.000387333
        [updatestate_loads_eliminate]: 0.000425044
        [cse]: 0.00193491
        [renormalize]: 6.90008e-07
[remove_dup_value]: 0.000110661
[tuple_transform]: 0.00356582, [1]
    [Cycle 1]: 0.00355534, [2]
        [d_1]: 0.00353368
        [renormalize]: 5.50004e-07
[add_cache_embedding]: 0.00431373
[add_recomputation]: 0.00476212
[cse_after_recomputation]: 0.00210252, [1]
    [Cycle 1]: 0.00208754, [1]
        [cse]: 0.00202823
[environ_conv]: 0.000867508
[label_micro_interleaved_index]: 3.60001e-06
[slice_recompute_activation]: 3.10101e-06
[micro_interleaved_order_control]: 2.51e-06
[reorder_send_recv_between_fp_bp]: 2.26e-06
[comm_op_add_attrs]: 2.41e-05
[add_comm_op_reuse_tag]: 1.91999e-06
[overlap_opt_shard_in_pipeline]: 1.62999e-06
[handle_group_info]: 1.49e-06

Sums parse : 0.027688s : 0.22% symbol_resolve.resolve : 0.044421s : 0.35% combine_like_graphs : 0.000001s : 0.00% meta_unpack_prepare : 0.000177s : 0.00% abstract_specialize : 0.450212s : 3.60% auto_monad : 0.008033s : 0.06% inline : 0.000058s : 0.00% pipeline_split : 0.000032s : 0.00% optimize.py_interpret_to_execute : 0.000857s : 0.01% optimize.simplify_data_structures : 0.001757s : 0.01% optimize.opt_a.expand_dump_flag : 0.000024s : 0.00% optimize.opt_a.switch_simplify : 0.002162s : 0.02% optimize.opt_a.a_1 : 0.106726s : 0.85% optimize.opt_a.recompute_prepare : 0.001223s : 0.01% optimize.opt_a.updatestate_depend_eliminate : 0.009046s : 0.07% optimize.opt_a.updatestate_assign_eliminate : 0.008423s : 0.07% optimize.opt_a.updatestate_loads_eliminate : 0.001097s : 0.01% optimize.opt_a.parameter_eliminate : 0.000008s : 0.00% optimize.opt_a.a_2 : 0.016431s : 0.13% optimize.opt_a.accelerated_algorithm : 0.001508s : 0.01% optimize.opt_a.pynative_shard : 0.000005s : 0.00% optimize.opt_a.auto_parallel : 0.000011s : 0.00% optimize.opt_a.parallel : 0.000024s : 0.00% optimize.opt_a.allreduce_fusion : 0.000466s : 0.00% optimize.opt_a.virtual_dataset : 0.000988s : 0.01% optimize.opt_a.get_gradeliminate : 0.000899s : 0.01% optimize.opt_a.virtual_output : 0.000901s : 0.01% optimize.opt_a.meta_fg_expand : 0.001994s : 0.02% optimize.opt_a.after_resolve : 0.002742s : 0.02% optimize.opt_a.a_after_grad : 0.001206s : 0.01% optimize.opt_a.renormalize : 0.044053s : 0.35% optimize.opt_a.real_op_eliminate : 0.001057s : 0.01% optimize.opt_a.auto_monad_grad : 0.000010s : 0.00% optimize.opt_a.auto_monad_eliminator : 0.004226s : 0.03% optimize.opt_a.cse : 0.008287s : 0.07% optimize.opt_a.a_3 : 0.009144s : 0.07% optimize.item_dict_eliminate_after_opt_a.mutable_eliminate : 0.000456s : 0.00% optimize.item_dict_eliminate_after_opt_a.item_dict_eliminate : 0.000747s : 0.01% optimize.clean_after_opta : 0.000493s : 0.00% optimize.opt_b.b_1 : 0.012488s : 0.10% optimize.opt_b.b_2 : 0.000553s : 0.00% optimize.opt_b.updatestate_depend_eliminate : 0.000306s : 0.00% optimize.opt_b.updatestate_assign_eliminate : 0.000388s : 0.00% optimize.opt_b.updatestate_loads_eliminate : 0.000427s : 0.00% optimize.opt_b.renormalize : 0.000001s : 0.00% optimize.opt_b.cse : 0.001932s : 0.02% optimize.cconv : 0.000307s : 0.00% optimize.opt_after_cconv.c_1 : 0.002027s : 0.02% optimize.opt_after_cconv.updatestate_depend_eliminate : 0.000309s : 0.00% optimize.opt_after_cconv.updatestate_assign_eliminate : 0.000387s : 0.00% optimize.opt_after_cconv.updatestate_loads_eliminate : 0.000425s : 0.00% optimize.opt_after_cconv.cse : 0.001935s : 0.02% optimize.opt_after_cconv.renormalize : 0.000001s : 0.00% optimize.remove_dup_value : 0.000111s : 0.00% optimize.tuple_transform.d_1 : 0.003534s : 0.03% optimize.tuple_transform.renormalize : 0.000001s : 0.00% optimize.add_cache_embedding : 0.004314s : 0.03% optimize.add_recomputation : 0.004762s : 0.04% optimize.cse_after_recomputation.cse : 0.002028s : 0.02% optimize.environ_conv : 0.000868s : 0.01% optimize.label_micro_interleaved_index : 0.000004s : 0.00% optimize.slice_recompute_activation : 0.000003s : 0.00% optimize.micro_interleaved_order_control : 0.000003s : 0.00% optimize.reorder_send_recv_between_fp_bp : 0.000002s : 0.00% optimize.comm_op_add_attrs : 0.000024s : 0.00% optimize.add_comm_op_reuse_tag : 0.000002s : 0.00% optimize.overlap_opt_shard_in_pipeline : 0.000002s : 0.00% optimize.handle_group_info : 0.000001s : 0.00% auto_monad_reorder : 0.002861s : 0.02% eliminate_forward_cnode : 0.000001s : 0.00% eliminate_special_op_node : 0.002115s : 0.02% validate : 0.002463s : 0.02% distribtued_split : 0.000002s : 0.00% task_emit : 11.720087s : 93.59% execute : 0.000011s : 0.00%

Time group info: ------[substitution.] 0.069899 13241 0.05% : 0.000035s : 2: substitution.depend_value_elim 61.05% : 0.042670s : 10: substitution.getattr_resolve 0.89% : 0.000624s : 1751: substitution.graph_param_transform 26.80% : 0.018731s : 955: substitution.inline 0.11% : 0.000079s : 320: substitution.less_batch_normalization 0.05% : 0.000036s : 9: substitution.meta_unpack_prepare 0.73% : 0.000507s : 1906: substitution.replace_old_param 2.34% : 0.001638s : 952: substitution.tuple_list_get_item_eliminator 2.96% : 0.002069s : 3508: substitution.updatestate_pure_node_eliminater 5.02% : 0.003510s : 3828: substitution.updatestate_useless_node_eliminater ------[renormalize.] 0.043854 2 50.05% : 0.021948s : 1: renormalize.infer 49.95% : 0.021905s : 1: renormalize.specialize ------[replace.] 0.020341 1916 5.97% : 0.001215s : 9: replace.getattr_resolve 61.80% : 0.012570s : 955: replace.inline 32.23% : 0.006556s : 952: replace.tuple_list_get_item_eliminator ------[match.] 0.063034 1916 67.69% : 0.042665s : 9: match.getattr_resolve 29.72% : 0.018731s : 955: match.inline 2.60% : 0.001638s : 952: match.tuple_list_get_item_eliminator ------[func_graph_cloner_run.] 0.037680 1004 34.60% : 0.013037s : 47: func_graph_cloner_run.FuncGraphClonerGraph 22.20% : 0.008364s : 862: func_graph_cloner_run.FuncGraphClonerNode 43.20% : 0.016279s : 95: func_graph_cloner_run.FuncGraphSpecializer ------[meta_graph.] 0.000000 0 ------[manager.] 0.000000 0 ------[pynative] 0.000000 0 ------[others.] 0.210734 104 12.09% : 0.025470s : 50: opt.transform.opt_a 5.90% : 0.012443s : 23: opt.transform.opt_b 21.07% : 0.044391s : 2: opt.transform.opt_resolve 0.57% : 0.001198s : 2: opt.transforms.item_dict_eliminate_after_opt_a 0.08% : 0.000160s : 1: opt.transforms.meta_unpack_prepare 56.65% : 0.119370s : 20: opt.transforms.opt_a 0.96% : 0.002024s : 1: opt.transforms.opt_after_cconv 0.26% : 0.000550s : 1: opt.transforms.opt_b 1.68% : 0.003531s : 1: opt.transforms.opt_trans_graph 0.76% : 0.001597s : 3: opt.transforms.special_op_eliminate

2023-07-11 12:31:07,575 [INFO] ema_weight not exist, default pretrain weight is currently used. 2023-07-11 12:31:07,722 [INFO] Dataset cache file hash/version check fail. 2023-07-11 12:31:07,722 [INFO] Datset caching now... Scanning '/home/ma-user/work/night_car/car_train.cache' images and labels... 4726 found, 0 missing, 179 empty, 0 corrupted: 100%|█| 4726/4726 [00:03< 2023-07-11 12:31:11,640 [INFO] New cache created: /home/ma-user/work/night_car/car_train.cache.npy 2023-07-11 12:31:11,647 [INFO] Dataset caching success. 2023-07-11 12:31:11,725 [INFO] Dataloader num parallel workers: [4] 2023-07-11 12:31:14,025 [INFO] Registry(name=callback, total=4) 2023-07-11 12:31:14,025 [INFO] (0): YoloxSwitchTrain in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] (1): EvalWhileTrain in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] (2): SummaryCallback in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] (3): ProfilerCallback in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] 2023-07-11 12:31:14,427 [INFO] got 1 active callback as follows: 2023-07-11 12:31:14,428 [INFO] SummaryCallback() 2023-07-11 12:31:14,428 [WARNING] The first epoch will be compiled for the graph, which may take a long time; You can come back later :). [ERROR] ANALYZER(28402,ffffbed26a70,python3):2023-07-11-12:58:09.445.089 [mindspore/ccsrc/pipeline/jit/static_analysis/async_eval_result.cc:66] HandleException] Exception happened, check the information as below.

The function call stack (See file '/home/ma-user/work/mindyolo/rank_0/om/analyze_fail.dat' for more details. Get instructions about analyze_fail.dat at https://www.mindspore.cn/search?inputValue=analyze_fail.dat):

0 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:72

    return train_step_func(*args)
           ^

1 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:57

    if optimizer_update:

2 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:52

    (loss, loss_items), grads = grad_fn(x, label)
                                ^

3 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py:574

                    return grad_(fn, weights)(*args)
                           ^

4 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:45

    loss, loss_items = loss_fn(pred, label, x)
                       ^

5 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:81

    for pp in p:

6 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:69

    bs, as_, gjs, gis, targets, anchors, tmasks = self.build_targets(p, targets, imgs)  # bs: (nl, bs*5*na*gt_max)
                                                  ^

7 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:150

    for i in range(self.nl):

8 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:126

    indices, anch, tmasks = self.find_3_positive(p, targets)
                            ^

9 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:296

    for i in range(self.nl):

10 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:298

        gain[2:6] = get_tensor(shape, targets.dtype)[[3, 2, 3, 2]]  # xyxy gain # [W, H, W, H]
        ^

11 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:918

if check_result:

12 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:921

    if step == 1 and not const_utils.is_ascend():

13 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:931

    if F.is_sequence_value_unknown(data_shape):

14 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:298

        gain[2:6] = get_tensor(shape, targets.dtype)[[3, 2, 3, 2]]  # xyxy gain # [W, H, W, H]
        ^

15 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:934

    indices = const_utils.slice2indices(input_slice, data_shape)
              ^

Traceback (most recent call last): File "train.py", line 309, in train(args) File "train.py", line 282, in train profiler_step_num=args.profiler_step_num File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 169, in train run_context.loss, run_context.lr = self.train_step(imgs, labels, cur_step=cur_step,cur_epoch=cur_epoch) File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 357, in train_step loss, lossitem, , grads_finite = self.train_step_fn(imgs, labels, True) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 588, in staging_specialize out = _MindsporeFunctionExecutor(func, hash_obj, input_signature, process_obj, jit_config)(args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 101, in wrapper results = fn(arg, kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 313, in call raise err File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 310, in call phase = self.compile(args_list, self.fn.name) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 386, in compile is_compile = self._graph_executor.compile(self.fn, compile_args, phase, True) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 811, in infer return {'dtype': None, 'shape': None, 'value': fn(value_args)} File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_constexpr_utils.py", line 401, in slice2indices mstype.int64, (), stop), P.Fill()(mstype.int64, (), step))] File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 317, in call return _run_op(self, self.name, args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 885, in _run_op return _run_op_sync(obj, op_name, args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 101, in wrapper results = fn(arg, kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 891, in _run_op_sync output = _pynative_executor.real_run_op(obj, op_name, args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1023, in real_run_op return self._executor.real_run_op(*args) RuntimeError: The node: Default/Range-op805 compute tiling failed!


(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)



xunfeng2zkj commented 1 year ago

| \/ | | | | | /\ | |
| \ / | _ _| | | | / \ _ | | | |\/| | / \ / | / _ \ | | / /\ \ | | | | / | | | | | | () | | (| | | / | | / _ \ | | | | _ \ || || \/ _ | _| || // _\ || _| |_/ Using user ma-user EulerOS 2.0 (SP8), CANN-6.0.1 Tips: 1) Navigate to the target conda environment. For details, see /home/ma-user/README. 2) Copy (Ctrl+C) and paste (Ctrl+V) on the jupyter terminal. 3) Store your data in /home/ma-user/work, to which a persistent volume is mounted.

zhanghuiyao commented 1 year ago

This seems to be related to the Mindshare version. You can try using the master branch code on Mindshare 2.0 and the r0.1 branch on Mindshare 1.8.1.

xunfeng2zkj commented 1 year ago

目前的这个错误是出现在modelarts上的mindspore2.0镜像(支持人员提供),modelarts上的mindspore-1.8.1镜像在r0.1也有不同的错误

xunfeng2zkj commented 1 year ago

尝试mindspore-1.8.1 有一下错误: ERROR] ANALYZER(77504,ffffa12a0a40,python3):2023-07-11-18:21:39.720.409 [mindspore/ccsrc/pipeline/jit/static_analysis/async_eval_result.cc:66] HandleException] Exception happened, check the information as below.

The function call stack (See file '/home/ma-user/work/mindyolo/rank_0/om/analyze_fail.dat' for more details. Get instructions about analyze_fail.dat at https://www.mindspore.cn/search?inputValue=analyze_fail.dat): 0 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py(81) for pp in p: 1 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py(86) for i in range(self.nl): # layer index ^ 2 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py(123) return _loss * bs, ops.stop_gradient(ops.stack((_loss, lbox, lobj, lcls))) ^ 3 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/function/array_func.py(1198) return _stack(input_x) ^

Traceback (most recent call last): File "train.py", line 290, in train(args) File "train.py", line 282, in train ms_jit=args.ms_jit File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 170, in train self.train_step(imgs, labels, cur_step=cur_step, cur_epoch=cur_epoch) File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 218, in train_step loss, lossitem, , grads_finite = self.train_step_fn(imgs, labels, True) File "/home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py", line 51, in train_step_func (loss, loss_items), grads = grad_fn(x, label) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/functional.py", line 453, in inner_aux_grad_fn res = aux_fn(args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/functional.py", line 435, in aux_fn outputs = fn(args) File "/home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py", line 44, in forward_func loss, loss_items = loss_fn(pred, label, x) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 578, in call out = self.compile_and_run(args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 965, in compile_and_run self.compile(inputs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 938, in compile jit_config_dict=self._jit_config_dict) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1137, in compile result = self._graph_executor.compile(obj, args_list, phase, self._use_vm_mode()) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/operations/array_ops.py", line 2862, in infer all_shape = _get_stack_shape(value, x_shape, x_type, self.axis, self.name) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/operations/array_ops.py", line 2781, in _get_stack_shape validator.check('x_type[%d]' % i, x_type[i], 'base', x_type[0], Rel.EQ, prim_name, TypeError) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/_checkparam.py", line 228, in check raise excp_cls(f'{msg_prefix} \'{arg_name}\' should be {rel_str}, but got {arg_value}.') TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32.

zhanghuiyao commented 1 year ago

看这个报错像是在图编译阶段出现的算子类型不匹配问题,有比较高的概率是跟cann包和mindspore版本相关;

zhanghuiyao commented 1 year ago

可以尝试运行以下命令查看mindspore版本并验证是否正常安装

pip show mindspore
cat /path_to/mindspore/.commit_id
python
>>> import mindspore as ms
>>> ms.run_check()
xunfeng2zkj commented 1 year ago

Name: mindspore-ascend Version: 1.8.1 Summary: MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios. Home-page: https://www.mindspore.cn Author: The MindSpore Authors Author-email: contact@mindspore.cn License: Apache 2.0 Location: /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages Requires: asttokens, astunparse, numpy, packaging, pillow, protobuf, psutil, scipy Required-by: mindx-elastic

MindSpore version: 1.8.1 The result of multiplication calculation is correct, MindSpore has been installed successfully!

======== 老师,我是先使用官方提供的镜像: image 然后进入镜像安装的mindspore-ascend-1.8.1,然后在modelarts上运行的

zhanghuiyao commented 1 year ago

这种方式有可能导致mindspore与cann版本不匹配而引发一些未知的错误,你可以尝试找官方支持人员提供标准的1.8.1/1.9的配套镜像;安装版本可以参考MindSpore官网

xunfeng2zkj commented 1 year ago

TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32. 老师,你知道这几种类型是什么吗,目前我只用lbox做loss_item 对结果有影响吗,虽然跑起来速度有点慢。

zhanghuiyao commented 1 year ago

TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32. 老师,你知道这几种类型是什么吗,目前我只用lbox做loss_item 对结果有影响吗,虽然跑起来速度有点慢。

类型信息可以在这个地方增加打印进行查看;如果只修改用于打印的loss,对结果不会有影响;

zhanghuiyao commented 1 year ago

建议使用指定的mindspore版本,其他版本可能会存在版本适配问题,mindspore安装可以参考 mindyolo-r0.1分支对应mindspore 1.8.1(以及对应的cann版本) mindyolo-master分支对应mindspore 2.0(以及对应的cann版本)

xunfeng2zkj commented 1 year ago

TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32. 老师,你知道这几种类型是什么吗,目前我只用lbox做loss_item 对结果有影响吗,虽然跑起来速度有点慢。

类型信息可以在这个地方增加打印进行查看;如果只修改用于打印的loss,对结果不会有影响;

老师,这个类型完全print不出来,静态模式无法切换

zhanghuiyao commented 1 year ago

可以尝试设置这两个参数以使用动态图方式运行代码

--ms_mode 1
--ms_jit False
LiaoYun0x0 commented 1 year ago

t int64, reduce precision from int64 to int32. Traceback (most recent call last): File "train.py", line 291, in train(args) File "train.py", line 283, in train ms_jit=args.ms_jit File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 170, in train self.train_step(imgs, labels, cur_step=cur_step, cur_epoch=cur_epoch) File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 218, in train_step loss, lossitem, , grads_finite = self.train_step_fn(imgs, labels, True) File "/home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py", line 51, in train_step_func (loss, loss_items), grads = grad_fn(x, label) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/functional.py", line 455, in inner_aux_grad_fn return res, _grad_weight(aux_fn, weights)(args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 530, in aftergrad return grad(fn, weights)(args, kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 98, in wrapper results = fn(*arg, *kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 518, in after_grad out = _pynativeexecutor(fn, grad.sens_param, args, kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1001, in call return self._executor(sens_param, obj, args) RuntimeError: Response is empty


/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 199 leaked semaphores to clean up at shutdown len(cache))

zhanghuiyao commented 1 year ago

看起来是求梯度的过程出现了问题

LiaoYun0x0 commented 1 year ago

看起来是求梯度的过程出现了问题

大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"

xunfeng2zkj commented 1 year ago

t int64, reduce precision from int64 to int32. Traceback (most recent call last): File "train.py", line 291, in train(args) File "train.py", line 283, in train ms_jit=args.ms_jit File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 170, in train self.train_step(imgs, labels, cur_step=cur_step, cur_epoch=cur_epoch) File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 218, in train_step loss, lossitem, , grads_finite = self.train_step_fn(imgs, labels, True) File "/home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py", line 51, in train_step_func (loss, loss_items), grads = grad_fn(x, label) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/functional.py", line 455, in inner_aux_grad_fn return res, _grad_weight(aux_fn, weights)(args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 530, in aftergrad return grad(fn, weights)(args, kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 98, in wrapper results = fn(*arg, *kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 518, in after_grad out = _pynativeexecutor(fn, grad.sens_param, args, kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1001, in call return self._executor(sens_param, obj, args) RuntimeError: Response is empty

  • C++ Call Stack: (For framework developers)

mindspore/ccsrc/backend/common/session/kernel_build_client.h:110 Response

/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 199 leaked semaphores to clean up at shutdown len(cache))

这个应该是内存泄露,每次跑一个epoch内存占用率会上升

zhanghuiyao commented 1 year ago

看起来是求梯度的过程出现了问题

大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"

这个warning一般不影响正常训练

xunfeng2zkj commented 1 year ago

看起来是求梯度的过程出现了问题

大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"

这个warning一般不影响正常训练

但是内存持续上涨,跑几个epoch自己就挂了

zhanghuiyao commented 1 year ago

看起来是求梯度的过程出现了问题

大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"

这个warning一般不影响正常训练

但是内存持续上涨,跑几个epoch自己就挂了

显存泄漏应该会报out of memory,这个看起来是pynative下执行或编译过程出了问题,可以尝试设置graph进行完整训练 --ms_mode 0 --ms_jit True

zhanghuiyao commented 1 year ago

看起来是求梯度的过程出现了问题

大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"

这个warning一般不影响正常训练

但是内存持续上涨,跑几个epoch自己就挂了

多问一句,这个是在coco数据集用的默认配置训练吗 还有环境的 代码、mindspore和cann包 的版本是否是匹配的

xunfeng2zkj commented 1 year ago

1.事实上,在modelarts上,使用官方的1.8.0的镜像安装了1.8.1,看文档应该是匹配的;

  1. 在训练上,本来也是使用默认配置进行训练,但运行中应该是类型不一致:TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32.
  2. 技术支持人员在2.0上跑说是没有问题。
zhanghuiyao commented 1 year ago

1.事实上,在modelarts上,使用官方的1.8.0的镜像安装了1.8.1,看文档应该是匹配的;

  1. 在训练上,本来也是使用默认配置进行训练,但运行中应该是类型不一致:TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32.
  2. 技术支持人员在2.0上跑说是没有问题。

mindspore和cann版本不匹配有可能会出现一些奇怪的问题,当前如果有2.0的标准环境可以直接在2.0上跑 对应的mindyolo代码可以用master分支的