Describe the bug/ 问题描述 (Mandatory / 必填)
当前run_distribute_train_ascend.sh代码中卡0日志无法保存
Hardware Environment(Ascend/GPU/CPU) / 硬件环境:
Please delete the backend not involved / 请删除不涉及的后端:
/device ascend
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :__commit_id__ = '[sha1]:8a30fd67,[branch]:(HEAD,origin/master,origin/HEAD,master)'
-- Python version (e.g., Python 3.7.5) :3.7.5
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):Ubuntu
-- GCC/Compiler version (if compiled from source):7.3.0
If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md
Describe the bug/ 问题描述 (Mandatory / 必填) 当前run_distribute_train_ascend.sh代码中卡0日志无法保存
Hardware Environment(
Ascend
/GPU
/CPU
) / 硬件环境:Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.7.0.Bxxx) :__commit_id__ = '[sha1]:8a30fd67,[branch]:(HEAD,origin/master,origin/HEAD,master)' -- Python version (e.g., Python 3.7.5) :3.7.5 -- OS platform and distribution (e.g., Linux Ubuntu 16.04):Ubuntu -- GCC/Compiler version (if compiled from source):7.3.0
Excute Mode / 执行模式 (Mandatory / 必填)(
PyNative
/Graph
):To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior:
Expected behavior / 预期结果 (Mandatory / 必填) 分布式训练卡0日志可以保存
Screenshots/ 日志 / 截图 (Mandatory / 必填) if [ $# != 1 ] then echo "Usage: bash run_distribute_train.sh [RANK_TABLE_FILE]" exit 1 fi
export RANK_TABLE_FILE=$1 export DEVICE_NUM=8 export RANK_SIZE=8
if [ ! -f $1 ] then echo "RANK_TABLE_FILE Does Not Exist!" exit 1 fi
for((i=1; i<${DEVICE_NUM}; i++)) do export DEVICE_ID=$i export RANK_ID=$i rm -rf ./train_parallel$i mkdir ./train_parallel$i cp ./.py ./train_parallel$i cp ./.yaml ./train_parallel$i cd ./train_parallel$i || exit echo "start training for rank $RANK_ID, device $DEVICE_ID" env > env.log python train_speaker_embeddings.py --need_generate_data=False --run_distribute=1 > train.log 2>&1 & cd .. done export DEVICE_ID=0 export RANK_ID=0 rm -rf ./train_parallel0 mkdir ./train_parallel0 cp ./.py ./train_parallel0 cp ./.yaml ./train_parallel0 cd ./train_parallel0 || exit echo "start training for rank $RANK_ID, device $DEVICE_ID" env > env.log python train_speaker_embeddings.py --need_generate_data=False --run_distribute=1 2>&1 cd ..
Additional context / 备注 (Optional / 选填) Add any other context about the problem here.