mindspore-lab / mindaudio

A toolbox of audio models and algorithms based on MindSpore
Apache License 2.0
35 stars 9 forks source link

[LibriSpeech-deepspeech2] [Ascend] [GRAPH] Distributed train failed #105

Closed 787918582 closed 1 year ago

787918582 commented 1 year ago

deepspeech2使用librispeech数据集基于Ascend做静态图分布式训练时,开始出现报错信息,但之后训练会正常进行,并有日志信息打印 版本信息:mindspore:1.9.0 mindaudio:2023.03.16 deepspeech2_3 deepspeech2_4

787918582 commented 1 year ago

回归时间:2023.03.29 回归版本:mindspore-1.9.0 mindaudio-20230329 回归步骤:mpirun --allow-run-as-root -n 8 python train.py --config ./hparams/DeepSpeech2.yaml --device_target Ascend --is_distributed --mode 0 回归现象: epoch: 1 step: 402, loss is 482.6315612792969 epoch: 1 step: 402, loss is 418.2529296875 epoch: 1 step: 402, loss is 304.5843505859375 epoch: 1 step: 402, loss is 345.1536560058594 epoch: 1 step: 402, loss is 432.37457275390625 epoch: 1 step: 402, loss is 412.51690673828125 epoch: 1 step: 402, loss is 331.03680419921875 epoch: 1 step: 402, loss is 435.46893310546875 Train epoch time: 4353877.791 ms, per step time: 10830.542 ms [ModelZoo-compile_time:0:00:00.000135] Train epoch time: 4354303.890 ms, per step time: 10831.602 ms [ModelZoo-compile_time:0:00:00.000128] Train epoch time: 4354963.109 ms, per step time: 10833.242 ms [ModelZoo-compile_time:0:00:00.000121] Train epoch time: 4355232.162 ms, per step time: 10833.911 ms [ModelZoo-compile_time:0:00:00.000123] Train epoch time: 4355354.919 ms, per step time: 10834.216 ms [ModelZoo-compile_time:0:00:00.000129] Train epoch time: 4355321.794 ms, per step time: 10834.134 ms [ModelZoo-compile_time:0:00:00.000111] Train epoch time: 4355389.336 ms, per step time: 10834.302 ms [ModelZoo-compile_time:0:00:00.000138] Train epoch time: 4355423.491 ms, per step time: 10834.387 ms [ModelZoo-compile_time:0:00:00.000116]

回归结论:回归通过