microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
5.99k stars 1.02k forks source link

program stop when i run my run_chinese.sh in GPU A100*4 80G #487

Open liyuyuan6969 opened 1 year ago

liyuyuan6969 commented 1 year ago

questions

hi guys,when i use A100 run run_chinese.sh example,i find the program stopped somewhere.But i run through English example.And program have not error reported.If this is a bug?Can you give me some advices?

train.log

L2hvbWUvbHl5Ly5jb25maWcvRGluZ1RhbGsvNzE4Njg2MzAxX3YyL0ltYWdlRmlsZXMvMTY4MzM3Mjc1OTAwMV8xNjgzMzcyNzU3NDMxLmpwZw==

my run_chinese.sh

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0
# DeepSpeed Team
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
    ZERO_STAGE=3
fi
mkdir -p $OUTPUT

# The Chinese data we found mostly only contain one response without another
# "rejected" response. Thus we only test the step 1 finetuning and use
# a data_split of 10,0,0 (keep all data for step 1).
deepspeed main.py \
   --data_path wangrui6/Zhihu-KOL \
   --data_split 10,0,0 \
   --model_name_or_path facebook/opt-1.3b \
   --per_device_train_batch_size 16 \
   --per_device_eval_batch_size 16 \
   --max_seq_len 512 \
   --learning_rate 9.65e-6 \
   --weight_decay 0. \
   --num_train_epochs 16 \
   --gradient_accumulation_steps 1 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --zero_stage $ZERO_STAGE \
   --deepspeed \
   --output_dir $OUTPUT \
   &> $OUTPUT/training.log
liyuyuan6969 commented 1 year ago

ok,i run again today,i find error message in train.log.Maybe my server network error?

message

image

liyuyuan6969 commented 1 year ago

ok,After i solve connection error according to this.i find it still stop above and have not error message.And have not guys response to me.I guess maybe my drive version and cuda version are too low.So i decide upgrade version

message

image

nvidia-smi

L2hvbWUvbHl5Ly5jb25maWcvRGluZ1RhbGsvNzE4Njg2MzAxX3YyL0ltYWdlRmlsZXMvMTY4MzUzMTQ1Njc1OV8xNjgzNTMxNDU0NTYyLmpwZw==

nvcc -V

L2hvbWUvbHl5Ly5jb25maWcvRGluZ1RhbGsvNzE4Njg2MzAxX3YyL0ltYWdlRmlsZXMvMTY4MzUzNDcxNjI4MF8xNjgzNTM0NzE0MzM1LmpwZw==

liyuyuan6969 commented 1 year ago

I upgraded CUDA version(11.7) and driver version(515) last night, but I still couldn't run this Chinese example(run_chinese.sh), and there was no error message output,so strange.maybe i can remote debug?But it maybe waste of many time.