mindspore-lab / mindcv

A toolbox of vision models and algorithms based on MindSpore
https://mindspore-lab.github.io/mindcv/
Apache License 2.0
231 stars 140 forks source link

[shufflenetv1] [Ascend910] [GRAPH] Distributed train failed #699

Closed 787918582 closed 1 month ago

787918582 commented 1 year ago

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) shufflenet_v1_0_5 & shufflenet_v1_1_0执行分布式训练报错

To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior:

  1. mpirun --allow-run-as-root -n 8 python train.py --config configs/shufflenetv1/shufflenet_v1_1.0_ascend.yaml --distribute True --data_dir /ImageNet_Origin/

Expected behavior / 预期结果 (Mandatory / 必填) 可跑通完整分布式训练

Screenshots/ 日志 / 截图 (Mandatory / 必填) shufflenetv1

Additional context / 备注 (Optional / 选填) Add any other context about the problem here. v2.1.0、v2.2.0、v2.2.1均复现该报错

tacyi commented 8 months ago

ms2.2.10.B180复现该报错

tacyi commented 8 months ago

MindSpore_v2.2.10.B180 训练也报错 RuntimeError: Found inconsistent format or data type! Op: Mul[@kernel_graph_2:207{[0]: ValueNode Mul, [1]: equiv_207, [2]: ValueNode Tensor(shape=[], dtype=Float32, value=0.04096)}],ame: Default/network-TrainOneStepCell/optimizer-Momentum/Mul-op1711