mindspore-lab / mindyolo

MindSpore YOLO series toolbox and benchmark
Apache License 2.0
81 stars 35 forks source link

训练报错 #186

Open WoooWZY opened 11 months ago

WoooWZY commented 11 months ago

[ERROR] DEVICE(246,ffff807c8ac0,python):2023-08-10-12:33:06.836.235 [mindspore/ccsrc/runtime/device/kernel_runtime_manager.cc:136] WaitTaskFinishOnDevice] SyncStream failed, exception:The pointer[stream] is null.


Traceback (most recent call last): File "train.py", line 290, in train(args) File "train.py", line 116, in train sync_bn=args.sync_bn, File "/home/ma-user/modelarts/user-job-dir/mindyolo/mindyolo/models/model_factory.py", line 30, in create_model model = create_fn(model_args, kwargs) File "/home/ma-user/modelarts/user-job-dir/mindyolo/mindyolo/models/yolov7.py", line 54, in yolov7 model = YOLOv7(cfg=cfg, in_channels=in_channels, num_classes=num_classes, *kwargs) File "/home/ma-user/modelarts/user-job-dir/mindyolo/mindyolo/models/yolov7.py", line 33, in init self.reset_parameter() File "/home/ma-user/modelarts/user-job-dir/mindyolo/mindyolo/models/yolov7.py", line 45, in reset_parameter m.initialize_biases() File "/home/ma-user/modelarts/user-job-dir/mindyolo/mindyolo/models/heads/yolov7_head.py", line 79, in initialize_biases for mi, s in zip(m.m, m.stride): # from File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/tensor.py", line 382, in getitem out = tensor_operator_registry.get('getitem')(self, index) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py", line 57, in _tensor_getitem return _tensor_index_by_integer(self, index) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py", line 407, in _tensor_index_by_integer return strided_slice(data, begin_strides, end_strides, step_strides, begin_mask, end_mask, 0, 0, shrink_axis_mask) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py", line 43, in strided_slice return stridedslice(data, begin_strides, end_strides, step_strides) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 294, in call return _run_op(self, self.name, args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 98, in wrapper results = fn(arg, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/primitive.py", line 730, in _run_op output = real_run_op(obj, op_name, args) RuntimeError: Ascend kernel runtime initialization failed.




liuchuting commented 11 months ago

可以试试将 init 里的numpy计算转成Tensor

CaitinZhao commented 11 months ago

请检查MindSpore版本,2.1之前不支持在910B上运行,另外910B的支持在逐步添加中,目前需要添加一些格外的环境变量:

图模式

MindSpore在910B上图模式执行需要应用GE的后端编译:

export MS_ENABLE_GE=1

当执行训练时需要添加:

export MS_GE_TRAIN=1

单独做推理时不需要设这个环境变量。

当原脚本中使用了checkpoint保存时,最好开启:

export MS_ENABLE_REF_MODE=1

PYNATIVE

使用PYNATIVE时,需要设置:

export MS_ENABLE_REF_MODE=1
export MS_DEV_FORCE_ACL=1

且不能设置GE的环境变量:

unset MS_ENABLE_GE
unset MS_GE_TRAIN