mindspore-lab / mindcv

A toolbox of vision models and algorithms based on MindSpore
https://mindspore-lab.github.io/mindcv/
Apache License 2.0
231 stars 140 forks source link

[rexnet] [Ascend910] [GRAPH] Distributed train failed #721

Closed 787918582 closed 1 month ago

787918582 commented 1 year ago

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) rexnet模型的所有规格执行分布式训练均报错

To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior:

  1. mpirun --allow-run-as-root -n 8 python train.py --config configs/rexnet/rexnet_x09_ascend.yaml --distribute True --data_dir /ImageNet_Origin/

Expected behavior / 预期结果 (Mandatory / 必填) 可以正常跑通分布式训练

Screenshots/ 日志 / 截图 (Mandatory / 必填) image image

Additional context / 备注 (Optional / 选填) Add any other context about the problem here.