mindspore-lab / mindaudio

A toolbox of audio models and algorithms based on MindSpore
Apache License 2.0
36 stars 10 forks source link

[LibriSpeech-deepspeech2] [GPU] [GRAPH] [PYNATIVE] The training fails because the machine memory is used up during training. #92

Closed 787918582 closed 1 year ago

787918582 commented 1 year ago

deepspeech2在gpu上执行训练时,内存占用过大,会导致内存爆掉,最终触发killed。尝试将batchsize降为1,仍存在该问题 执行步骤:python train.py --config ./hparams/DeepSpeech2.yaml --device_target GPU --device_id 0

787918582 commented 1 year ago

经验证,是数据集问题造成的。更新数据集后目前在gpu上可正常训练 回归时间:2023.03.29 回归版本:mindspore:1.9.0 mindaudio:20230328 回归步骤: 1.python train.py --config ./hparams/DeepSpeech2.yaml --device_target GPU --mode 0 2.python train.py --config ./hparams/DeepSpeech2.yaml --device_target GPU --mode 1 3.mpirun --allow-run-as-root -n 8 python train.py --config ./hparams/DeepSpeech2.yaml --device_target GPU --is_distributed --mode 0 4.mpirun --allow-run-as-root -n 8 python train.py --config ./hparams/DeepSpeech2.yaml --device_target GPU --is_distributed --mode 1 回归现象:epoch: 1 step: 35, loss is 514.43310546875 epoch: 1 step: 35, loss is 541.4406127929688 epoch: 1 step: 35, loss is 557.6650390625 epoch: 1 step: 35, loss is 292.83624267578125 epoch: 1 step: 35, loss is 522.20654296875 epoch: 1 step: 35, loss is 521.9188842773438 epoch: 1 step: 36, loss is 438.53326416015625 epoch: 1 step: 36, loss is 378.19488525390625 epoch: 1 step: 36, loss is 435.86273193359375 epoch: 1 step: 36, loss is 570.6893310546875 epoch: 1 step: 36, loss is 491.56768798828125 epoch: 1 step: 36, loss is 501.5151672363281 epoch: 1 step: 36, loss is 462.94720458984375 epoch: 1 step: 36, loss is 473.602294921875 回归结论:回归通过