mindspore-lab / mindcv

A toolbox of vision models and algorithms based on MindSpore
https://mindspore-lab.github.io/mindcv/
Apache License 2.0
235 stars 143 forks source link

[nasnet_a_4x1056] [Ascend910] [GRAPH] Distributed train failed #717

Closed 787918582 closed 3 months ago

787918582 commented 1 year ago

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) 分布式训练报错

To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior:

  1. mpirun --allow-run-as-root -n 8 python train.py --config configs/nasnet/nasnet_a_4x1056_ascend.yaml --distribute True --data_dir /ImageNet_Origin/

Expected behavior / 预期结果 (Mandatory / 必填) 正常启动分布式训练

Screenshots/ 日志 / 截图 (Mandatory / 必填) image

Additional context / 备注 (Optional / 选填) Add any other context about the problem here.

tacyi commented 9 months ago

MindSpore_v2.2.10.B180 训练精度不正常