mindspore-lab / mindyolo

A toolbox of yolo models and algorithms based on MindSpore
Apache License 2.0
91 stars 39 forks source link

mpirun 多卡训练报错EOFError #238

Closed lonngxiang closed 1 month ago

lonngxiang commented 10 months ago

mpirun --allow-run-as-root -n 2 python train.py --config ./configs/yolov8/yolov8n1.yaml --is_parallel True

image

zhanghuiyao commented 10 months ago

这个是什么环境 还有上面的报错也可以看看

lonngxiang commented 10 months ago

这个是什么环境 还有上面的报错也可以看看

云环境

[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] Failing at address: 0xb8 [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [ 0] linux-vdso.so.1(kernel_rt_sigreturn+0x0)[0xffff97f657c0] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [ 1] /usr/local/Ascend/ascend-toolkit/latest/lib64/libhcom_graph_adaptor.so(_ZN4hccl22HcomOpsKernelInfoStore19GetCommFromTaskInfoERKN2ge10GETaskInfoERl+0x40)[0xffff552c55b4] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [ 2] /usr/local/Ascend/ascend-toolkit/latest/lib64/libhcom_graph_adaptor.so(_ZN4hccl22HcomOpsKernelInfoStore10UnloadTaskERN2ge10GETaskInfoE+0x474)[0xffff55311fa8] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [ 3] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(+0x30fde90)[0xffff7accfe90] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [ 4] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(_ZNSt16_Sp_counted_baseILN9gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff7abafc04] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [ 5] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(+0x30f466c)[0xffff7acc666c] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [ 6] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(_ZNSt16_Sp_counted_baseILN9gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff7abafc04] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [ 7] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(+0x30f2d84)[0xffff7acc4d84] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [ 8] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(+0x306d0a0)[0xffff7ac3f0a0] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [ 9] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore6device20KernelRuntimeManager18ClearGraphResourceEj+0xa0)[0xffff8e020310] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [10] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore7session11KernelGraphD1Ev+0xb4)[0xffff8dceaaf4] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [11] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff93305a74] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [12] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore7session14KernelGraphMgrD2Ev+0x39c)[0xffff8dd1da4c] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [13] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff93305a74] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [14] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff93305a74] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [15] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZN9mindspore7compile17MindRTBackendBaseD1Ev+0x64)[0xffff93c064ec] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [16] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff93305a74] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [17] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x25347b8)[0xffff93af47b8] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [18] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff93305a74] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [19] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x250a78c)[0xffff93aca78c] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [20] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x250aca0)[0xffff93acaca0] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [21] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x2520dc4)[0xffff93ae0dc4] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [22] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x2500b48)[0xffff93ac0b48] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [23] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x1d5f340)[0xffff9331f340] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [24] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x1d5bb5c)[0xffff9331bb5c] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [25] python(+0x20e3bc)[0xaaaaabb3b3bc] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [26] python(_PyObject_MakeTpCall+0xa0)[0xaaaaab9aacb0] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [27] python(+0x1f8e18)[0xaaaaabb25e18] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [28] python(_PyEval_EvalFrameDefault+0x5ec0)[0xaaaaab999f30] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] [29] python(+0x1134c0)[0xaaaaaba404c0] [notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3436928] End of error message

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Exception in thread Thread-1: Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 60, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "", line 2, in get File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 810, in _callmethod kind, result = conn.recv() File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 255, in recv buf = self._recv_bytes() File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes buf = self._recv(4) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 388, in _recv raise EOFError EOFError [WARNING] ME(3443384:281473600106672,WriterPool-31):2023-11-14-15:22:22.528.179 [mindspore/train/summary/_writer_pool.py:192] The training process 3436928 has exited, summary process will exit.

mpirun noticed that process rank 0 with PID 0 on node notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60 exited on signal 11 (Segmentation fault). image

zhanghuiyao commented 10 months ago

可以先跑一下单卡看看代码本身有没有问题,没问题的话可能是环境的 openmpi、multiprocessing 包不对

lonngxiang commented 10 months ago

可以先跑一下单卡看看代码本身有没有问题,没问题的话可能是环境的 openmpi、multiprocessing 包不对

嗯单卡跑没问题,包版本是:

multiprocess 0.70.12.2

openmpi这个包没看到,需要单独安装吗;mpirun指令是可以用的 ![Uploading image.png…]()

lonngxiang commented 10 months ago

paddlenlp 2.5.2 requires multiprocess<=0.70.12.2 这个版本是对应的

lonngxiang commented 10 months ago

我把mindspore 从2.0升到2.2,现在运行报这个错误

RuntimeError: Unsupported device target Ascend. This process only supports one of the ['CPU']. Please check whether the Ascend environment is installed and configured correctly, and check whether current mindspore wheel package was built with "-e Ascend". For details, please refer to "Device load error message".




Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[24669,1],

zhanghuiyao commented 10 months ago

mindspore版本应该要跟notebook运行节点的driver版本一致才能跑,单卡可以跑的话可以不用升级mindspore

zhanghuiyao commented 10 months ago

paddlenlp 2.5.2 requires multiprocess<=0.70.12.2 这个版本是对应的

为啥要调用到 paddlenlp库?

zhanghuiyao commented 10 months ago

版本可以参考这个试试 https://github.com/mindspore-lab/mindyolo/blob/master/docs/en/installation.md

lonngxiang commented 10 months ago

paddlenlp 2.5.2 requires multiprocess<=0.70.12.2 这个版本是对应的

为啥要调用到 paddlenlp库?

不要意思打错,multiprocess 版本我升级到 0.70.15 还是不行

lonngxiang commented 10 months ago

版本可以参考这个试试 https://github.com/mindspore-lab/mindyolo/blob/master/docs/en/installation.md

嗯,我看了下我本机mpirun (Open MPI) 版本4.0.6,确实有点不一样,我重新安装再看看

lonngxiang commented 10 months ago

暂时还是不行,这是完整报错日志

lonngxiang commented 10 months ago

mpirun --allow-run-as-root -n 2 python train.py --config  ./configs/yolov8/yolov8n1.yaml   --is_parallel True
[WARNING] DEVICE(3627895,ffffa0f680b0,python):2023-11-15-16:39:19.924.005 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_assign.cc:1762] InsertEventCommonDependHcom] Hcom node:Default/Broadcast-op5, can't find target for insert recv op, no insert send/recv
[WARNING] DEVICE(3627895,ffffa0f680b0,python):2023-11-15-16:39:19.924.081 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_assign.cc:1690] GraphLoopSync] There is no event between computing stream and hcom stream in graph 0 need insert event.
[WARNING] DEVICE(3627893,ffff9829b0b0,python):2023-11-15-16:39:20.437.552 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_assign.cc:1762] InsertEventCommonDependHcom] Hcom node:Default/Broadcast-op5, can't find target for insert recv op, no insert send/recv
[WARNING] DEVICE(3627893,ffff9829b0b0,python):2023-11-15-16:39:20.437.624 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_assign.cc:1690] GraphLoopSync] There is no event between computing stream and hcom stream in graph 0 need insert event.
2023-11-15 16:39:20,995 [INFO] parse_args:
2023-11-15 16:39:20,995 [INFO] task                                    detect
2023-11-15 16:39:20,995 [INFO] device_target                           Ascend
2023-11-15 16:39:20,995 [INFO] save_dir                                ./runs/2023.11.15-16.38.57
2023-11-15 16:39:20,995 [INFO] device_per_servers                      8
2023-11-15 16:39:20,995 [INFO] log_level                               INFO
2023-11-15 16:39:20,995 [INFO] is_parallel                             True
2023-11-15 16:39:20,995 [INFO] ms_mode                                 0
2023-11-15 16:39:20,995 [INFO] ms_amp_level                            O0
2023-11-15 16:39:20,995 [INFO] keep_loss_fp32                          True
2023-11-15 16:39:20,995 [INFO] ms_loss_scaler                          static
2023-11-15 16:39:20,995 [INFO] ms_loss_scaler_value                    1024.0
2023-11-15 16:39:20,995 [INFO] ms_jit                                  True
2023-11-15 16:39:20,995 [INFO] ms_enable_graph_kernel                  False
2023-11-15 16:39:20,995 [INFO] ms_datasink                             False
2023-11-15 16:39:20,995 [INFO] overflow_still_update                   True
2023-11-15 16:39:20,995 [INFO] clip_grad                               False
2023-11-15 16:39:20,995 [INFO] clip_grad_value                         10.0
2023-11-15 16:39:20,995 [INFO] ema                                     True
2023-11-15 16:39:20,995 [INFO] weight                                  
2023-11-15 16:39:20,995 [INFO] ema_weight                              
2023-11-15 16:39:20,995 [INFO] freeze                                  []
2023-11-15 16:39:20,995 [INFO] epochs                                  100
2023-11-15 16:39:20,995 [INFO] per_batch_size                          16
2023-11-15 16:39:20,995 [INFO] img_size                                640
2023-11-15 16:39:20,995 [INFO] nbs                                     64
2023-11-15 16:39:20,995 [INFO] accumulate                              1
2023-11-15 16:39:20,995 [INFO] auto_accumulate                         False
2023-11-15 16:39:20,995 [INFO] log_interval                            100
2023-11-15 16:39:20,995 [INFO] single_cls                              False
2023-11-15 16:39:20,995 [INFO] sync_bn                                 True
2023-11-15 16:39:20,995 [INFO] keep_checkpoint_max                     100
2023-11-15 16:39:20,995 [INFO] run_eval                                False
2023-11-15 16:39:20,995 [INFO] conf_thres                              0.001
2023-11-15 16:39:20,995 [INFO] iou_thres                               0.7
2023-11-15 16:39:20,995 [INFO] conf_free                               True
2023-11-15 16:39:20,995 [INFO] rect                                    False
2023-11-15 16:39:20,995 [INFO] nms_time_limit                          20.0
2023-11-15 16:39:20,995 [INFO] recompute                               False
2023-11-15 16:39:20,995 [INFO] recompute_layers                        0
2023-11-15 16:39:20,995 [INFO] seed                                    2
2023-11-15 16:39:20,995 [INFO] summary                                 True
2023-11-15 16:39:20,995 [INFO] profiler                                False
2023-11-15 16:39:20,995 [INFO] profiler_step_num                       1
2023-11-15 16:39:20,995 [INFO] opencv_threads_num                      0
2023-11-15 16:39:20,995 [INFO] strict_load                             True
2023-11-15 16:39:20,995 [INFO] enable_modelarts                        False
2023-11-15 16:39:20,995 [INFO] data_url                                
2023-11-15 16:39:20,995 [INFO] ckpt_url                                
2023-11-15 16:39:20,995 [INFO] multi_data_url                          
2023-11-15 16:39:20,995 [INFO] pretrain_url                            
2023-11-15 16:39:20,995 [INFO] train_url                               
2023-11-15 16:39:20,995 [INFO] data_dir                                /cache/data/
2023-11-15 16:39:20,995 [INFO] ckpt_dir                                /cache/pretrain_ckpt/
2023-11-15 16:39:20,995 [INFO] data.dataset_name                       gesture
2023-11-15 16:39:20,995 [INFO] data.train_set                          /home/ma-user/work/loong/yolo/Rock_paper_scissor_test/train/images
2023-11-15 16:39:20,995 [INFO] data.val_set                            /home/ma-user/work/loong/yolo/Rock_paper_scissor_test/valid/images
2023-11-15 16:39:20,995 [INFO] data.test_set                           /home/ma-user/work/loong/yolo/Rock_paper_scissor_test/test/images
2023-11-15 16:39:20,995 [INFO] data.nc                                 3
2023-11-15 16:39:20,995 [INFO] data.names                              ['Paper', 'Rock', 'Scissor']
2023-11-15 16:39:20,995 [INFO] roboflow.workspace                      sambhavs-vision
2023-11-15 16:39:20,995 [INFO] roboflow.project                        rock-paper-scissor-odf1i
2023-11-15 16:39:20,995 [INFO] roboflow.version                        2
2023-11-15 16:39:20,995 [INFO] roboflow.license                        CC BY 4.0
2023-11-15 16:39:20,995 [INFO] roboflow.url                            https://universe.roboflow.com/sambhavs-vision/rock-paper-scissor-odf1i/dataset/2
2023-11-15 16:39:20,995 [INFO] data.num_parallel_workers               4
2023-11-15 16:39:20,995 [INFO] train_transforms.stage_epochs           [90, 10]
2023-11-15 16:39:20,995 [INFO] train_transforms.trans_list             [[{'func_name': 'mosaic', 'prob': 1.0}, {'func_name': 'resample_segments'}, {'func_name': 'random_perspective', 'prob': 1.0, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0}, {'func_name': 'albumentations'}, {'func_name': 'hsv_augment', 'prob': 1.0, 'hgain': 0.015, 'sgain': 0.7, 'vgain': 0.4}, {'func_name': 'fliplr', 'prob': 0.5}, {'func_name': 'label_norm', 'xyxy2xywh_': True}, {'func_name': 'label_pad', 'padding_size': 160, 'padding_value': -1}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}], [{'func_name': 'letterbox', 'scaleup': True}, {'func_name': 'resample_segments'}, {'func_name': 'random_perspective', 'prob': 1.0, 'degrees': 0.0, 'translate': 0.1, 'scale': 0.5, 'shear': 0.0}, {'func_name': 'albumentations'}, {'func_name': 'hsv_augment', 'prob': 1.0, 'hgain': 0.015, 'sgain': 0.7, 'vgain': 0.4}, {'func_name': 'fliplr', 'prob': 0.5}, {'func_name': 'label_norm', 'xyxy2xywh_': True}, {'func_name': 'label_pad', 'padding_size': 160, 'padding_value': -1}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}]]
2023-11-15 16:39:20,995 [INFO] data.test_transforms                    [{'func_name': 'letterbox', 'scaleup': False, 'only_image': True}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}]
2023-11-15 16:39:20,995 [INFO] optimizer.optimizer                     momentum
2023-11-15 16:39:20,995 [INFO] optimizer.lr_init                       0.01
2023-11-15 16:39:20,995 [INFO] optimizer.momentum                      0.937
2023-11-15 16:39:20,995 [INFO] optimizer.nesterov                      True
2023-11-15 16:39:20,995 [INFO] optimizer.loss_scale                    1.0
2023-11-15 16:39:20,995 [INFO] optimizer.warmup_epochs                 3
2023-11-15 16:39:20,995 [INFO] optimizer.warmup_momentum               0.8
2023-11-15 16:39:20,995 [INFO] optimizer.warmup_bias_lr                0.1
2023-11-15 16:39:20,995 [INFO] optimizer.min_warmup_step               1000
2023-11-15 16:39:20,995 [INFO] optimizer.group_param                   yolov8
2023-11-15 16:39:20,995 [INFO] optimizer.gp_weight_decay               0.0005
2023-11-15 16:39:20,995 [INFO] optimizer.start_factor                  1.0
2023-11-15 16:39:20,995 [INFO] optimizer.end_factor                    0.01
2023-11-15 16:39:20,995 [INFO] optimizer.epochs                        100
2023-11-15 16:39:20,995 [INFO] optimizer.nbs                           64
2023-11-15 16:39:20,995 [INFO] optimizer.accumulate                    1
2023-11-15 16:39:20,995 [INFO] optimizer.total_batch_size              32
2023-11-15 16:39:20,995 [INFO] loss.name                               YOLOv8Loss
2023-11-15 16:39:20,995 [INFO] loss.box                                7.5
2023-11-15 16:39:20,995 [INFO] loss.cls                                0.5
2023-11-15 16:39:20,995 [INFO] loss.dfl                                1.5
2023-11-15 16:39:20,995 [INFO] loss.reg_max                            16
2023-11-15 16:39:20,995 [INFO] network.model_name                      yolov8
2023-11-15 16:39:20,995 [INFO] network.reg_max                         16
2023-11-15 16:39:20,995 [INFO] network.stride                          [8, 16, 32]
2023-11-15 16:39:20,995 [INFO] network.backbone                        [[-1, 1, 'ConvNormAct', [64, 3, 2]], [-1, 1, 'ConvNormAct', [128, 3, 2]], [-1, 3, 'C2f', [128, True]], [-1, 1, 'ConvNormAct', [256, 3, 2]], [-1, 6, 'C2f', [256, True]], [-1, 1, 'ConvNormAct', [512, 3, 2]], [-1, 6, 'C2f', [512, True]], [-1, 1, 'ConvNormAct', [1024, 3, 2]], [-1, 3, 'C2f', [1024, True]], [-1, 1, 'SPPF', [1024, 5]]]
2023-11-15 16:39:20,995 [INFO] network.head                            [[-1, 1, 'Upsample', ['None', 2, 'nearest']], [[-1, 6], 1, 'Concat', [1]], [-1, 3, 'C2f', [512]], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [[-1, 4], 1, 'Concat', [1]], [-1, 3, 'C2f', [256]], [-1, 1, 'ConvNormAct', [256, 3, 2]], [[-1, 12], 1, 'Concat', [1]], [-1, 3, 'C2f', [512]], [-1, 1, 'ConvNormAct', [512, 3, 2]], [[-1, 9], 1, 'Concat', [1]], [-1, 3, 'C2f', [1024]], [[15, 18, 21], 1, 'YOLOv8Head', ['nc', 'reg_max', 'stride']]]
2023-11-15 16:39:20,995 [INFO] network.depth_multiple                  0.33
2023-11-15 16:39:20,995 [INFO] network.width_multiple                  0.25
2023-11-15 16:39:20,995 [INFO] network.max_channels                    1024
2023-11-15 16:39:20,995 [INFO] config                                  ./configs/yolov8/yolov8n1.yaml
2023-11-15 16:39:20,995 [INFO] rank                                    0
2023-11-15 16:39:20,995 [INFO] rank_size                               2
2023-11-15 16:39:20,995 [INFO] total_batch_size                        32
2023-11-15 16:39:20,995 [INFO] callback                                []
2023-11-15 16:39:20,995 [INFO] 
2023-11-15 16:39:20,998 [INFO] Please check the above information for the configurations
2023-11-15 16:39:21,000 [INFO] Parse model with Sync BN.
2023-11-15 16:39:28,676 [WARNING] Parse Model, args: nearest, keep str type
2023-11-15 16:39:29,865 [WARNING] Parse Model, args: nearest, keep str type
2023-11-15 16:39:38,316 [INFO] number of network params, total: 3.021836M, trainable: 3.011417M
2023-11-15 16:40:00,928 [WARNING] Parse Model, args: nearest, keep str type
2023-11-15 16:40:02,104 [WARNING] Parse Model, args: nearest, keep str type
2023-11-15 16:40:10,514 [INFO] number of network params, total: 3.021836M, trainable: 3.011417M
2023-11-15 16:40:26,499 [INFO] ema_weight not exist, default pretrain weight is currently used.
2023-11-15 16:40:26,550 [INFO] Dataset Cache file hash/version check success.
2023-11-15 16:40:26,551 [INFO] Load dataset cache from [/home/ma-user/work/loong/yolo/Rock_paper_scissor_test/train/labels.cache.npy] success.
Scanning '/home/ma-user/work/loong/yolo/Rock_paper_scissor_test/train/labels.cache.npy' images and labels... 1318 found, 0 missing, 
2023-11-15 16:40:26,555 [INFO] Dataloader num parallel workers: [4]
2023-11-15 16:40:26,615 [INFO] Dataset Cache file hash/version check success.
2023-11-15 16:40:26,615 [INFO] Load dataset cache from [/home/ma-user/work/loong/yolo/Rock_paper_scissor_test/train/labels.cache.npy] success.
Scanning '/home/ma-user/work/loong/yolo/Rock_paper_scissor_test/train/labels.cache.npy' images and labels... 1318 found, 0 missing, 
2023-11-15 16:40:26,618 [INFO] Dataloader num parallel workers: [4]
Scanning '/home/ma-user/work/loong/yolo/Rock_paper_scissor_test/train/labels.cache.npy' images and labels... 1318 found, 0 missing, 
Scanning '/home/ma-user/work/loong/yolo/Rock_paper_scissor_test/train/labels.cache.npy' images and labels... 1318 found, 0 missing, 
2023-11-15 16:40:32,300 [INFO] Registry(name=callback, total=4)
2023-11-15 16:40:32,300 [INFO]   (0): YoloxSwitchTrain in mindyolo/utils/callback.py
2023-11-15 16:40:32,300 [INFO]   (1): EvalWhileTrain in mindyolo/utils/callback.py
2023-11-15 16:40:32,300 [INFO]   (2): SummaryCallback in mindyolo/utils/callback.py
2023-11-15 16:40:32,300 [INFO]   (3): ProfilerCallback in mindyolo/utils/callback.py
2023-11-15 16:40:32,300 [INFO] 
2023-11-15 16:40:33,442 [INFO] got 1 active callback as follows:
2023-11-15 16:40:33,443 [INFO] SummaryCallback()
2023-11-15 16:40:33,443 [WARNING] log interval should be less than total steps of one epoch, but got 100 > 41, set log_interval as steps_per_epoch 41
2023-11-15 16:40:33,443 [WARNING] The first epoch will be compiled for the graph, which may take a long time; You can come back later :).
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[INFO] albumentations load success

[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] *** Process received signal ***
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] Signal: Segmentation fault (11)
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] Signal code: Address not mapped (1)
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] Failing at address: 0xb8
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffff982a77c0]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [ 1] /usr/local/Ascend/ascend-toolkit/latest/lib64/libhcom_graph_adaptor.so(_ZN4hccl22HcomOpsKernelInfoStore19GetCommFromTaskInfoERKN2ge10GETaskInfoERl+0x40)[0xffff555fc5b4]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [ 2] /usr/local/Ascend/ascend-toolkit/latest/lib64/libhcom_graph_adaptor.so(_ZN4hccl22HcomOpsKernelInfoStore10UnloadTaskERN2ge10GETaskInfoE+0x474)[0xffff55648fa8]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [ 3] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(+0x30fde90)[0xffff7b011e90]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [ 4] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff7aef1c04]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [ 5] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(+0x30f466c)[0xffff7b00866c]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [ 6] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff7aef1c04]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [ 7] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(+0x30f2d84)[0xffff7b006d84]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [ 8] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.1(+0x306d0a0)[0xffff7af810a0]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [ 9] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore6device20KernelRuntimeManager18ClearGraphResourceEj+0xa0)[0xffff8e362310]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [10] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore7session11KernelGraphD1Ev+0xb4)[0xffff8e02caf4]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [11] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff93647a74]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [12] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore7session14KernelGraphMgrD2Ev+0x39c)[0xffff8e05fa4c]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [13] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff93647a74]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [14] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff93647a74]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [15] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZN9mindspore7compile17MindRTBackendBaseD1Ev+0x64)[0xffff93f484ec]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [16] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff93647a74]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [17] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x25347b8)[0xffff93e367b8]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [18] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0xb4)[0xffff93647a74]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [19] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x250a78c)[0xffff93e0c78c]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [20] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x250aca0)[0xffff93e0cca0]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [21] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x2520dc4)[0xffff93e22dc4]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [22] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x2500b48)[0xffff93e02b48]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [23] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x1d5f340)[0xffff93661340]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [24] /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/_c_expression.cpython-39-aarch64-linux-gnu.so(+0x1d5bb5c)[0xffff9365db5c]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [25] python(+0x20e3bc)[0xaaaae50f73bc]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [26] python(_PyObject_MakeTpCall+0xa0)[0xaaaae4f66cb0]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [27] python(+0x1f8e18)[0xaaaae50e1e18]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [28] python(_PyEval_EvalFrameDefault+0x5ec0)[0xaaaae4f55f30]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] [29] python(+0x1134c0)[0xaaaae4ffc4c0]
[notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60:3627893] *** End of error message ***

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[WARNING] ME(3633368:281473346101424,WriterPool-31):2023-11-15-17:02:57.691.745 [mindspore/train/summary/_writer_pool.py:192] The training process 3627893 has exited, summary process will exit.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 60, in run
    key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)
  File "<string>", line 2, in get
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/managers.py", line 810, in _callmethod
    kind, result = conn.recv()
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node notebook-2a7fcf5e-9744-41c7-9c1c-5e37eeafdf60 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
(MindSpore) [ma-user mindyolo]$echo $DEVICE_ID0,1

'''
lonngxiang commented 10 months ago

@zhanghuiyao 另外这边mindscope版本2.0.0,昇腾910b,是否是兼容问题导致?

datalee commented 5 months ago

怎么样?什么问题

Ash-Lee233 commented 1 month ago

建议先使用python -c "import mindspore;mindspore.set_context(device_target='Ascend');mindspore.run_check()"检查mindspore是否正常安装

Ash-Lee233 commented 1 month ago

问题单先关闭,如仍遇到问题可以提交新的issue或更改issue状态并提供相应信息