Closed xunfeng2zkj closed 3 months ago
| \/ | | | | | /\ | |
| \ / | _ _| | | | / \ _ | |
| |\/| | / \ / | / _ \ | | / /\ \ | | | | / |
| | | | | () | | (| | | / | | / _ \ | | | | _ \
|| || \/ _ | _| || // _\ || _| |_/
Using user ma-user
EulerOS 2.0 (SP8), CANN-6.0.1
Tips:
1) Navigate to the target conda environment. For details, see /home/ma-user/README.
2) Copy (Ctrl+C) and paste (Ctrl+V) on the jupyter terminal.
3) Store your data in /home/ma-user/work, to which a persistent volume is mounted.
This seems to be related to the Mindshare version. You can try using the master branch code on Mindshare 2.0 and the r0.1 branch on Mindshare 1.8.1.
目前的这个错误是出现在modelarts上的mindspore2.0镜像(支持人员提供),modelarts上的mindspore-1.8.1镜像在r0.1也有不同的错误
尝试mindspore-1.8.1 有一下错误: ERROR] ANALYZER(77504,ffffa12a0a40,python3):2023-07-11-18:21:39.720.409 [mindspore/ccsrc/pipeline/jit/static_analysis/async_eval_result.cc:66] HandleException] Exception happened, check the information as below.
The function call stack (See file '/home/ma-user/work/mindyolo/rank_0/om/analyze_fail.dat' for more details. Get instructions about analyze_fail.dat
at https://www.mindspore.cn/search?inputValue=analyze_fail.dat):
0 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py(81)
for pp in p:
1 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py(86)
for i in range(self.nl): # layer index
^
2 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py(123)
return _loss * bs, ops.stop_gradient(ops.stack((_loss, lbox, lobj, lcls)))
^
3 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/function/array_func.py(1198)
return _stack(input_x)
^
Traceback (most recent call last):
File "train.py", line 290, in
看这个报错像是在图编译阶段出现的算子类型不匹配问题,有比较高的概率是跟cann包和mindspore版本相关;
可以尝试运行以下命令查看mindspore版本并验证是否正常安装
pip show mindspore
cat /path_to/mindspore/.commit_id
python
>>> import mindspore as ms
>>> ms.run_check()
MindSpore version: 1.8.1 The result of multiplication calculation is correct, MindSpore has been installed successfully!
======== 老师,我是先使用官方提供的镜像: 然后进入镜像安装的mindspore-ascend-1.8.1,然后在modelarts上运行的
这种方式有可能导致mindspore与cann版本不匹配而引发一些未知的错误,你可以尝试找官方支持人员提供标准的1.8.1/1.9的配套镜像;安装版本可以参考MindSpore官网
TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32. 老师,你知道这几种类型是什么吗,目前我只用lbox做loss_item 对结果有影响吗,虽然跑起来速度有点慢。
TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32. 老师,你知道这几种类型是什么吗,目前我只用lbox做loss_item 对结果有影响吗,虽然跑起来速度有点慢。
类型信息可以在这个地方增加打印进行查看;如果只修改用于打印的loss,对结果不会有影响;
建议使用指定的mindspore版本,其他版本可能会存在版本适配问题,mindspore安装可以参考 mindyolo-r0.1分支对应mindspore 1.8.1(以及对应的cann版本) mindyolo-master分支对应mindspore 2.0(以及对应的cann版本)
TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32. 老师,你知道这几种类型是什么吗,目前我只用lbox做loss_item 对结果有影响吗,虽然跑起来速度有点慢。
类型信息可以在这个地方增加打印进行查看;如果只修改用于打印的loss,对结果不会有影响;
老师,这个类型完全print不出来,静态模式无法切换
可以尝试设置这两个参数以使用动态图方式运行代码
--ms_mode 1
--ms_jit False
t int64, reduce precision from int64 to int32.
Traceback (most recent call last):
File "train.py", line 291, in
mindspore/ccsrc/backend/common/session/kernel_build_client.h:110 Response
/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 199 leaked semaphores to clean up at shutdown len(cache))
看起来是求梯度的过程出现了问题
看起来是求梯度的过程出现了问题
大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"
t int64, reduce precision from int64 to int32. Traceback (most recent call last): File "train.py", line 291, in train(args) File "train.py", line 283, in train ms_jit=args.ms_jit File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 170, in train self.train_step(imgs, labels, cur_step=cur_step, cur_epoch=cur_epoch) File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 218, in train_step loss, lossitem, , grads_finite = self.train_step_fn(imgs, labels, True) File "/home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py", line 51, in train_step_func (loss, loss_items), grads = grad_fn(x, label) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/functional.py", line 455, in inner_aux_grad_fn return res, _grad_weight(aux_fn, weights)(args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 530, in aftergrad return grad(fn, weights)(args, kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 98, in wrapper results = fn(*arg, *kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 518, in after_grad out = _pynativeexecutor(fn, grad.sens_param, args, kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1001, in call return self._executor(sens_param, obj, args) RuntimeError: Response is empty
- C++ Call Stack: (For framework developers)
mindspore/ccsrc/backend/common/session/kernel_build_client.h:110 Response
/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 199 leaked semaphores to clean up at shutdown len(cache))
这个应该是内存泄露,每次跑一个epoch内存占用率会上升
看起来是求梯度的过程出现了问题
大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"
这个warning一般不影响正常训练
看起来是求梯度的过程出现了问题
大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"
这个warning一般不影响正常训练
但是内存持续上涨,跑几个epoch自己就挂了
看起来是求梯度的过程出现了问题
大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"
这个warning一般不影响正常训练
但是内存持续上涨,跑几个epoch自己就挂了
显存泄漏应该会报out of memory,这个看起来是pynative下执行或编译过程出了问题,可以尝试设置graph进行完整训练
--ms_mode 0 --ms_jit True
看起来是求梯度的过程出现了问题
大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"
这个warning一般不影响正常训练
但是内存持续上涨,跑几个epoch自己就挂了
多问一句,这个是在coco数据集用的默认配置训练吗 还有环境的 代码、mindspore和cann包 的版本是否是匹配的
1.事实上,在modelarts上,使用官方的1.8.0的镜像安装了1.8.1,看文档应该是匹配的;
- 在训练上,本来也是使用默认配置进行训练,但运行中应该是类型不一致:TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32.
- 技术支持人员在2.0上跑说是没有问题。
1.事实上,在modelarts上,使用官方的1.8.0的镜像安装了1.8.1,看文档应该是匹配的;
- 在训练上,本来也是使用默认配置进行训练,但运行中应该是类型不一致:TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32.
- 技术支持人员在2.0上跑说是没有问题。
mindspore和cann版本不匹配有可能会出现一些奇怪的问题,当前如果有2.0的标准环境可以直接在2.0上跑 对应的mindyolo代码可以用master分支的
Environment
Hardware Environment(
Ascend
/GPU
/CPU
):Software Environment:
Describe the current behavior
Describe the expected behavior
Steps to reproduce the issue
Related log / screenshot
Special notes for this issue
python3 train.py -c configs/yolov7/yolov7.yaml 2023-07-11 12:30:32,961 [INFO] parse_args: 2023-07-11 12:30:32,961 [INFO] device_target Ascend 2023-07-11 12:30:32,961 [INFO] save_dir ./runs/2023.07.11-12.30.32 2023-07-11 12:30:32,961 [INFO] device_per_servers 8 2023-07-11 12:30:32,961 [INFO] log_level INFO 2023-07-11 12:30:32,961 [INFO] is_parallel False 2023-07-11 12:30:32,961 [INFO] ms_mode 0 2023-07-11 12:30:32,961 [INFO] ms_amp_level O0 2023-07-11 12:30:32,961 [INFO] keep_loss_fp32 True 2023-07-11 12:30:32,961 [INFO] ms_loss_scaler static 2023-07-11 12:30:32,961 [INFO] ms_loss_scaler_value 1024.0 2023-07-11 12:30:32,961 [INFO] ms_grad_sens 1024.0 2023-07-11 12:30:32,961 [INFO] ms_jit True 2023-07-11 12:30:32,961 [INFO] ms_enable_graph_kernel False 2023-07-11 12:30:32,961 [INFO] ms_datasink False 2023-07-11 12:30:32,961 [INFO] overflow_still_update True 2023-07-11 12:30:32,961 [INFO] ema True 2023-07-11 12:30:32,961 [INFO] weight
2023-07-11 12:30:32,961 [INFO] ema_weight
2023-07-11 12:30:32,961 [INFO] freeze [] 2023-07-11 12:30:32,961 [INFO] epochs 300 2023-07-11 12:30:32,961 [INFO] per_batch_size 16 2023-07-11 12:30:32,961 [INFO] img_size 640 2023-07-11 12:30:32,961 [INFO] nbs 64 2023-07-11 12:30:32,961 [INFO] accumulate 1 2023-07-11 12:30:32,961 [INFO] auto_accumulate False 2023-07-11 12:30:32,961 [INFO] log_interval 100 2023-07-11 12:30:32,961 [INFO] single_cls False 2023-07-11 12:30:32,961 [INFO] sync_bn False 2023-07-11 12:30:32,961 [INFO] keep_checkpoint_max 100 2023-07-11 12:30:32,961 [INFO] run_eval False 2023-07-11 12:30:32,961 [INFO] conf_thres 0.001 2023-07-11 12:30:32,961 [INFO] iou_thres 0.65 2023-07-11 12:30:32,961 [INFO] conf_free False 2023-07-11 12:30:32,961 [INFO] rect False 2023-07-11 12:30:32,961 [INFO] nms_time_limit 20.0 2023-07-11 12:30:32,961 [INFO] recompute True 2023-07-11 12:30:32,961 [INFO] recompute_layers 5 2023-07-11 12:30:32,961 [INFO] seed 2 2023-07-11 12:30:32,961 [INFO] summary True 2023-07-11 12:30:32,961 [INFO] profiler False 2023-07-11 12:30:32,961 [INFO] profiler_step_num 1 2023-07-11 12:30:32,961 [INFO] opencv_threads_num 2 2023-07-11 12:30:32,961 [INFO] enable_modelarts False 2023-07-11 12:30:32,961 [INFO] data_url
2023-07-11 12:30:32,961 [INFO] ckpt_url
2023-07-11 12:30:32,961 [INFO] multi_data_url
2023-07-11 12:30:32,961 [INFO] pretrain_url
2023-07-11 12:30:32,961 [INFO] train_url
2023-07-11 12:30:32,961 [INFO] data_dir /cache/data/ 2023-07-11 12:30:32,961 [INFO] ckpt_dir /cache/pretrain_ckpt/ 2023-07-11 12:30:32,961 [INFO] data.path /home/ma-user/work/ 2023-07-11 12:30:32,961 [INFO] data.train_set /home/ma-user/work/night_car/car_train.txt 2023-07-11 12:30:32,961 [INFO] data.val_set /home/ma-user/work/night_car/car_val.txt 2023-07-11 12:30:32,961 [INFO] data.test_set /home/ma-user/work/night_car/car_val.txt 2023-07-11 12:30:32,961 [INFO] data.nc 1 2023-07-11 12:30:32,961 [INFO] data.names ['car'] 2023-07-11 12:30:32,961 [INFO] data.dataset_name coco 2023-07-11 12:30:32,961 [INFO] data.train_transforms [{'func_name': 'mosaic', 'prob': 1.0, 'mosaic9_prob': 0.2, 'translate': 0.2, 'scale': 0.9}, {'func_name': 'mixup', 'prob': 0.15, 'alpha': 8.0, 'beta': 8.0, 'needed_mosaic': True}, {'func_name': 'hsv_augment', 'prob': 1.0, 'hgain': 0.015, 'sgain': 0.7, 'vgain': 0.4}, {'func_name': 'pastein', 'prob': 0.15, 'num_sample': 30}, {'func_name': 'labelnorm', 'xyxy2xywh': True}, {'func_name': 'fliplr', 'prob': 0.5}, {'func_name': 'label_pad', 'padding_size': 160, 'padding_value': -1}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}] 2023-07-11 12:30:32,961 [INFO] data.test_transforms [{'func_name': 'letterbox', 'scaleup': False}, {'func_name': 'labelnorm', 'xyxy2xywh': True}, {'func_name': 'label_pad', 'padding_size': 160, 'padding_value': -1}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}] 2023-07-11 12:30:32,961 [INFO] data.num_parallel_workers 4 2023-07-11 12:30:32,961 [INFO] optimizer.optimizer momentum 2023-07-11 12:30:32,961 [INFO] optimizer.lr_init 0.01 2023-07-11 12:30:32,961 [INFO] optimizer.momentum 0.937 2023-07-11 12:30:32,961 [INFO] optimizer.nesterov True 2023-07-11 12:30:32,961 [INFO] optimizer.loss_scale 1.0 2023-07-11 12:30:32,961 [INFO] optimizer.warmup_epochs 3 2023-07-11 12:30:32,961 [INFO] optimizer.warmup_momentum 0.8 2023-07-11 12:30:32,961 [INFO] optimizer.warmup_bias_lr 0.1 2023-07-11 12:30:32,961 [INFO] optimizer.min_warmup_step 1000 2023-07-11 12:30:32,961 [INFO] optimizer.group_param yolov7 2023-07-11 12:30:32,961 [INFO] optimizer.gp_weight_decay 0.0005 2023-07-11 12:30:32,961 [INFO] optimizer.start_factor 1.0 2023-07-11 12:30:32,961 [INFO] optimizer.end_factor 0.1 2023-07-11 12:30:32,961 [INFO] optimizer.epochs 300 2023-07-11 12:30:32,961 [INFO] optimizer.nbs 64 2023-07-11 12:30:32,961 [INFO] optimizer.accumulate 1 2023-07-11 12:30:32,961 [INFO] optimizer.total_batch_size 16 2023-07-11 12:30:32,961 [INFO] loss.name YOLOv7Loss 2023-07-11 12:30:32,961 [INFO] loss.box 0.05 2023-07-11 12:30:32,961 [INFO] loss.cls 0.3 2023-07-11 12:30:32,961 [INFO] loss.cls_pw 1.0 2023-07-11 12:30:32,961 [INFO] loss.obj 0.7 2023-07-11 12:30:32,961 [INFO] loss.obj_pw 1.0 2023-07-11 12:30:32,961 [INFO] loss.fl_gamma 0.0 2023-07-11 12:30:32,961 [INFO] loss.anchor_t 4.0 2023-07-11 12:30:32,961 [INFO] loss.label_smoothing 0.0 2023-07-11 12:30:32,961 [INFO] network.model_name yolov7 2023-07-11 12:30:32,961 [INFO] network.depth_multiple 1.0 2023-07-11 12:30:32,961 [INFO] network.width_multiple 1.0 2023-07-11 12:30:32,961 [INFO] network.stride [8, 16, 32] 2023-07-11 12:30:32,961 [INFO] network.anchors [[12, 16, 19, 36, 40, 28], [36, 75, 76, 55, 72, 146], [142, 110, 192, 243, 459, 401]] 2023-07-11 12:30:32,961 [INFO] network.backbone [[-1, 1, 'ConvNormAct', [32, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 2]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 2]], [-1, 1, 'ConvNormAct', [64, 1, 1]], [-2, 1, 'ConvNormAct', [64, 1, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-3, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 2]], [[-1, -3], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-2, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [512, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-3, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 2]], [[-1, -3], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [1024, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [512, 1, 1]], [-3, 1, 'ConvNormAct', [512, 1, 1]], [-1, 1, 'ConvNormAct', [512, 3, 2]], [[-1, -3], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [1024, 1, 1]]] 2023-07-11 12:30:32,961 [INFO] network.head [[-1, 1, 'SPPCSPC', [512]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [37, 1, 'ConvNormAct', [256, 1, 1]], [[-1, -2], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [24, 1, 'ConvNormAct', [128, 1, 1]], [[-1, -2], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-2, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-3, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 2]], [[-1, -3, 63], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-3, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 2]], [[-1, -3, 51], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [512, 1, 1]], [-2, 1, 'ConvNormAct', [512, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [512, 1, 1]], [75, 1, 'RepConv', [256, 3, 1]], [88, 1, 'RepConv', [512, 3, 1]], [101, 1, 'RepConv', [1024, 3, 1]], [[102, 103, 104], 1, 'YOLOv7Head', ['nc', 'anchors', 'stride']]] 2023-07-11 12:30:32,961 [INFO] config configs/yolov7/yolov7.yaml 2023-07-11 12:30:32,961 [INFO] rank 0 2023-07-11 12:30:32,961 [INFO] rank_size 1 2023-07-11 12:30:32,961 [INFO] total_batch_size 16 2023-07-11 12:30:32,961 [INFO] callback [] 2023-07-11 12:30:32,961 [INFO] 2023-07-11 12:30:32,963 [INFO] Please check the above information for the configurations 2023-07-11 12:30:33,910 [WARNING] Parse Model, args: nearest, keep str type 2023-07-11 12:30:34,007 [WARNING] Parse Model, args: nearest, keep str type 2023-07-11 12:30:34,410 [INFO] number of network params, total: 37.246339M, trainable: 37.196556M 2023-07-11 12:30:34,422 [INFO] Turn on recompute, and the results of the first 5 layers will be recomputed. 2023-07-11 12:30:54,044 [WARNING] Parse Model, args: nearest, keep str type 2023-07-11 12:30:54,141 [WARNING] Parse Model, args: nearest, keep str type 2023-07-11 12:30:54,554 [INFO] number of network params, total: 37.246339M, trainable: 37.196556M 2023-07-11 12:30:54,566 [INFO] Turn on recompute, and the results of the first 5 layers will be recomputed.
TotalTime = 12.5261, [16]
[symbol_resolve]: 0.0445111, [1] [Cycle 1]: 0.0444429, [1]
[optimize]: 0.267299, [22]
Sums parse : 0.027688s : 0.22% symbol_resolve.resolve : 0.044421s : 0.35% combine_like_graphs : 0.000001s : 0.00% meta_unpack_prepare : 0.000177s : 0.00% abstract_specialize : 0.450212s : 3.60% auto_monad : 0.008033s : 0.06% inline : 0.000058s : 0.00% pipeline_split : 0.000032s : 0.00% optimize.py_interpret_to_execute : 0.000857s : 0.01% optimize.simplify_data_structures : 0.001757s : 0.01% optimize.opt_a.expand_dump_flag : 0.000024s : 0.00% optimize.opt_a.switch_simplify : 0.002162s : 0.02% optimize.opt_a.a_1 : 0.106726s : 0.85% optimize.opt_a.recompute_prepare : 0.001223s : 0.01% optimize.opt_a.updatestate_depend_eliminate : 0.009046s : 0.07% optimize.opt_a.updatestate_assign_eliminate : 0.008423s : 0.07% optimize.opt_a.updatestate_loads_eliminate : 0.001097s : 0.01% optimize.opt_a.parameter_eliminate : 0.000008s : 0.00% optimize.opt_a.a_2 : 0.016431s : 0.13% optimize.opt_a.accelerated_algorithm : 0.001508s : 0.01% optimize.opt_a.pynative_shard : 0.000005s : 0.00% optimize.opt_a.auto_parallel : 0.000011s : 0.00% optimize.opt_a.parallel : 0.000024s : 0.00% optimize.opt_a.allreduce_fusion : 0.000466s : 0.00% optimize.opt_a.virtual_dataset : 0.000988s : 0.01% optimize.opt_a.get_gradeliminate : 0.000899s : 0.01% optimize.opt_a.virtual_output : 0.000901s : 0.01% optimize.opt_a.meta_fg_expand : 0.001994s : 0.02% optimize.opt_a.after_resolve : 0.002742s : 0.02% optimize.opt_a.a_after_grad : 0.001206s : 0.01% optimize.opt_a.renormalize : 0.044053s : 0.35% optimize.opt_a.real_op_eliminate : 0.001057s : 0.01% optimize.opt_a.auto_monad_grad : 0.000010s : 0.00% optimize.opt_a.auto_monad_eliminator : 0.004226s : 0.03% optimize.opt_a.cse : 0.008287s : 0.07% optimize.opt_a.a_3 : 0.009144s : 0.07% optimize.item_dict_eliminate_after_opt_a.mutable_eliminate : 0.000456s : 0.00% optimize.item_dict_eliminate_after_opt_a.item_dict_eliminate : 0.000747s : 0.01% optimize.clean_after_opta : 0.000493s : 0.00% optimize.opt_b.b_1 : 0.012488s : 0.10% optimize.opt_b.b_2 : 0.000553s : 0.00% optimize.opt_b.updatestate_depend_eliminate : 0.000306s : 0.00% optimize.opt_b.updatestate_assign_eliminate : 0.000388s : 0.00% optimize.opt_b.updatestate_loads_eliminate : 0.000427s : 0.00% optimize.opt_b.renormalize : 0.000001s : 0.00% optimize.opt_b.cse : 0.001932s : 0.02% optimize.cconv : 0.000307s : 0.00% optimize.opt_after_cconv.c_1 : 0.002027s : 0.02% optimize.opt_after_cconv.updatestate_depend_eliminate : 0.000309s : 0.00% optimize.opt_after_cconv.updatestate_assign_eliminate : 0.000387s : 0.00% optimize.opt_after_cconv.updatestate_loads_eliminate : 0.000425s : 0.00% optimize.opt_after_cconv.cse : 0.001935s : 0.02% optimize.opt_after_cconv.renormalize : 0.000001s : 0.00% optimize.remove_dup_value : 0.000111s : 0.00% optimize.tuple_transform.d_1 : 0.003534s : 0.03% optimize.tuple_transform.renormalize : 0.000001s : 0.00% optimize.add_cache_embedding : 0.004314s : 0.03% optimize.add_recomputation : 0.004762s : 0.04% optimize.cse_after_recomputation.cse : 0.002028s : 0.02% optimize.environ_conv : 0.000868s : 0.01% optimize.label_micro_interleaved_index : 0.000004s : 0.00% optimize.slice_recompute_activation : 0.000003s : 0.00% optimize.micro_interleaved_order_control : 0.000003s : 0.00% optimize.reorder_send_recv_between_fp_bp : 0.000002s : 0.00% optimize.comm_op_add_attrs : 0.000024s : 0.00% optimize.add_comm_op_reuse_tag : 0.000002s : 0.00% optimize.overlap_opt_shard_in_pipeline : 0.000002s : 0.00% optimize.handle_group_info : 0.000001s : 0.00% auto_monad_reorder : 0.002861s : 0.02% eliminate_forward_cnode : 0.000001s : 0.00% eliminate_special_op_node : 0.002115s : 0.02% validate : 0.002463s : 0.02% distribtued_split : 0.000002s : 0.00% task_emit : 11.720087s : 93.59% execute : 0.000011s : 0.00%
Time group info: ------[substitution.] 0.069899 13241 0.05% : 0.000035s : 2: substitution.depend_value_elim 61.05% : 0.042670s : 10: substitution.getattr_resolve 0.89% : 0.000624s : 1751: substitution.graph_param_transform 26.80% : 0.018731s : 955: substitution.inline 0.11% : 0.000079s : 320: substitution.less_batch_normalization 0.05% : 0.000036s : 9: substitution.meta_unpack_prepare 0.73% : 0.000507s : 1906: substitution.replace_old_param 2.34% : 0.001638s : 952: substitution.tuple_list_get_item_eliminator 2.96% : 0.002069s : 3508: substitution.updatestate_pure_node_eliminater 5.02% : 0.003510s : 3828: substitution.updatestate_useless_node_eliminater ------[renormalize.] 0.043854 2 50.05% : 0.021948s : 1: renormalize.infer 49.95% : 0.021905s : 1: renormalize.specialize ------[replace.] 0.020341 1916 5.97% : 0.001215s : 9: replace.getattr_resolve 61.80% : 0.012570s : 955: replace.inline 32.23% : 0.006556s : 952: replace.tuple_list_get_item_eliminator ------[match.] 0.063034 1916 67.69% : 0.042665s : 9: match.getattr_resolve 29.72% : 0.018731s : 955: match.inline 2.60% : 0.001638s : 952: match.tuple_list_get_item_eliminator ------[func_graph_cloner_run.] 0.037680 1004 34.60% : 0.013037s : 47: func_graph_cloner_run.FuncGraphClonerGraph 22.20% : 0.008364s : 862: func_graph_cloner_run.FuncGraphClonerNode 43.20% : 0.016279s : 95: func_graph_cloner_run.FuncGraphSpecializer ------[meta_graph.] 0.000000 0 ------[manager.] 0.000000 0 ------[pynative] 0.000000 0 ------[others.] 0.210734 104 12.09% : 0.025470s : 50: opt.transform.opt_a 5.90% : 0.012443s : 23: opt.transform.opt_b 21.07% : 0.044391s : 2: opt.transform.opt_resolve 0.57% : 0.001198s : 2: opt.transforms.item_dict_eliminate_after_opt_a 0.08% : 0.000160s : 1: opt.transforms.meta_unpack_prepare 56.65% : 0.119370s : 20: opt.transforms.opt_a 0.96% : 0.002024s : 1: opt.transforms.opt_after_cconv 0.26% : 0.000550s : 1: opt.transforms.opt_b 1.68% : 0.003531s : 1: opt.transforms.opt_trans_graph 0.76% : 0.001597s : 3: opt.transforms.special_op_eliminate
2023-07-11 12:31:07,575 [INFO] ema_weight not exist, default pretrain weight is currently used. 2023-07-11 12:31:07,722 [INFO] Dataset cache file hash/version check fail. 2023-07-11 12:31:07,722 [INFO] Datset caching now... Scanning '/home/ma-user/work/night_car/car_train.cache' images and labels... 4726 found, 0 missing, 179 empty, 0 corrupted: 100%|█| 4726/4726 [00:03< 2023-07-11 12:31:11,640 [INFO] New cache created: /home/ma-user/work/night_car/car_train.cache.npy 2023-07-11 12:31:11,647 [INFO] Dataset caching success. 2023-07-11 12:31:11,725 [INFO] Dataloader num parallel workers: [4] 2023-07-11 12:31:14,025 [INFO] Registry(name=callback, total=4) 2023-07-11 12:31:14,025 [INFO] (0): YoloxSwitchTrain in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] (1): EvalWhileTrain in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] (2): SummaryCallback in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] (3): ProfilerCallback in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] 2023-07-11 12:31:14,427 [INFO] got 1 active callback as follows: 2023-07-11 12:31:14,428 [INFO] SummaryCallback() 2023-07-11 12:31:14,428 [WARNING] The first epoch will be compiled for the graph, which may take a long time; You can come back later :). [ERROR] ANALYZER(28402,ffffbed26a70,python3):2023-07-11-12:58:09.445.089 [mindspore/ccsrc/pipeline/jit/static_analysis/async_eval_result.cc:66] HandleException] Exception happened, check the information as below.
The function call stack (See file '/home/ma-user/work/mindyolo/rank_0/om/analyze_fail.dat' for more details. Get instructions about
analyze_fail.dat
at https://www.mindspore.cn/search?inputValue=analyze_fail.dat):0 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:72
1 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:57
2 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:52
3 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py:574
4 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:45
5 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:81
6 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:69
7 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:150
8 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:126
9 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:296
10 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:298
11 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:918
12 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:921
13 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:931
14 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:298
15 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:934
Traceback (most recent call last): File "train.py", line 309, in
train(args)
File "train.py", line 282, in train
profiler_step_num=args.profiler_step_num
File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 169, in train
run_context.loss, run_context.lr = self.train_step(imgs, labels, cur_step=cur_step,cur_epoch=cur_epoch)
File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 357, in train_step
loss, lossitem, , grads_finite = self.train_step_fn(imgs, labels, True)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 588, in staging_specialize
out = _MindsporeFunctionExecutor(func, hash_obj, input_signature, process_obj, jit_config)(args)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 101, in wrapper
results = fn(arg, kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 313, in call
raise err
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 310, in call
phase = self.compile(args_list, self.fn.name)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 386, in compile
is_compile = self._graph_executor.compile(self.fn, compile_args, phase, True)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 811, in infer
return {'dtype': None, 'shape': None, 'value': fn(value_args)}
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_constexpr_utils.py", line 401, in slice2indices
mstype.int64, (), stop), P.Fill()(mstype.int64, (), step))]
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 317, in call
return _run_op(self, self.name, args)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 885, in _run_op
return _run_op_sync(obj, op_name, args)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 101, in wrapper
results = fn(arg, kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 891, in _run_op_sync
output = _pynative_executor.real_run_op(obj, op_name, args)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1023, in real_run_op
return self._executor.real_run_op(*args)
RuntimeError: The node: Default/Range-op805 compute tiling failed!
Ascend Error Message:
E89999: Inner Error, Please contact support engineer! E89999 op[Range], compile info not contain [_pattern][FUNC:AutoTilingHandlerParser][FILE:auto_tiling.cc][LINE:67] TraceBack (most recent call last): Failed to parse compile json[{"_sgt_cube_vector_core_type":"AiCore","device_id":"0"}] for op[Range, Range].[FUNC:TurnToOpParaCalculateV4][FILE:op_tiling.cc][LINE:442]
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
The Traceback of Net Construct Code:
The function call stack (See file '/home/ma-user/work/mindyolo/rank_0/om/analyze_fail.dat' for more details. Get instructions about
analyze_fail.dat
at https://www.mindspore.cn/search?inputValue=analyze_fail.dat):0 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:72
1 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:57
2 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:52
3 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py:574
4 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:45
5 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:81
6 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:69
7 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:150
8 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:126
9 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:296
10 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:298
11 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:918
if check_result:
12 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:921
13 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:931
14 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:298
15 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:934
C++ Call Stack: (For framework developers)
mindspore/ccsrc/plugin/device/ascend/kernel/tbe/dynamic_tbe_kernel_mod.cc:126 Resize