mindspore-lab / mindaudio

A toolbox of audio models and algorithms based on MindSpore
Apache License 2.0
40 stars 12 forks source link

[LibriSpeech-deepspeech2] [Ascend] [PYNATIVE] Distributed train failed #104

Closed 787918582 closed 1 year ago

787918582 commented 1 year ago

deepspeech2使用librispeech数据集基于Ascend做动态图分布式训练时,开始出现报错信息,但之后训练会正常进行,并有日志信息打印 版本信息:mindspore:1.9.0 mindaudio:2023.03.16 deepspeech2_1 deepspeech2_2

787918582 commented 1 year ago

回归时间:2023.03.29 回归版本:mindspore-1.9.0 mindaudio-20230329 回归步骤:mpirun --allow-run-as-root -n 8 python train.py --config ./hparams/DeepSpeech2.yaml --device_target Ascend --is_distributed --mode 1 回归现象:epoch: 1 step: 18, loss is 869.990478515625 epoch: 1 step: 18, loss is 911.465576171875 epoch: 1 step: 18, loss is 1010.1331176757812 epoch: 1 step: 18, loss is 1261.926025390625 epoch: 1 step: 18, loss is 847.7034912109375 epoch: 1 step: 18, loss is 1045.37158203125 epoch: 1 step: 18, loss is 870.4517211914062 回归结论:回归通过