mindspore-lab / mindaudio

A toolbox of audio models and algorithms based on MindSpore
Apache License 2.0
37 stars 10 forks source link

[ecapa-tdnn] [Ascend] The code of distributed script need to modify #175

Closed 787918582 closed 3 months ago

787918582 commented 1 year ago

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) 当前run_distribute_train_ascend.sh代码中卡0日志无法保存

To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior:

  1. bash run_distribute_train_ascend.sh /data3/zl/Mindlab_data/dataset/hccl_8p.json

Expected behavior / 预期结果 (Mandatory / 必填) 分布式训练卡0日志可以保存

Screenshots/ 日志 / 截图 (Mandatory / 必填) if [ $# != 1 ] then echo "Usage: bash run_distribute_train.sh [RANK_TABLE_FILE]" exit 1 fi

export RANK_TABLE_FILE=$1 export DEVICE_NUM=8 export RANK_SIZE=8

if [ ! -f $1 ] then echo "RANK_TABLE_FILE Does Not Exist!" exit 1 fi

for((i=1; i<${DEVICE_NUM}; i++)) do export DEVICE_ID=$i export RANK_ID=$i rm -rf ./train_parallel$i mkdir ./train_parallel$i cp ./.py ./train_parallel$i cp ./.yaml ./train_parallel$i cd ./train_parallel$i || exit echo "start training for rank $RANK_ID, device $DEVICE_ID" env > env.log python train_speaker_embeddings.py --need_generate_data=False --run_distribute=1 > train.log 2>&1 & cd .. done export DEVICE_ID=0 export RANK_ID=0 rm -rf ./train_parallel0 mkdir ./train_parallel0 cp ./.py ./train_parallel0 cp ./.yaml ./train_parallel0 cd ./train_parallel0 || exit echo "start training for rank $RANK_ID, device $DEVICE_ID" env > env.log python train_speaker_embeddings.py --need_generate_data=False --run_distribute=1 2>&1 cd ..

Additional context / 备注 (Optional / 选填) Add any other context about the problem here.

vigo999 commented 1 year ago

please check @LiTingyu1997