Closed TommyTunaToro closed 1 year ago
"exits with return code = -9" probably indicates out of memory problem and this seems happened during data preparation. Could you try reducing the number of datasets: change the https://github.com/microsoft/DeepSpeedExamples/blob/9a586b1b9852fcc6005652170e12cace8c914fbb/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language/run_chinese.sh#LL20C4-L20C107 to just only one dataset, for example "--data_path wangrui6/Zhihu-KOL \ ".
Thanks conglongli, I tried that, now it is killing the program at subprocess 17883. It seems further the process with only one dataset comapring to previous subprocess 15240. I have also tried to decrease the batch size to 4, and now it is showing me killed at subprocess 20026? Does it mean I moved further with the process? Should I keep on reducing the batch size? Kindly advise, thanks! By the way, either methods returned with code -9
-9 means you are probably still out of memory. Can you check how much CPU memory do you have? On the other hand, wangrui6/Zhihu-KOL seems to be a quite large dataset, could you try "--data_path Cohere/miracl-zh-queries-22-12 \ "
-9 means you are probably still out of memory. Can you check how much CPU memory do you have? On the other hand, wangrui6/Zhihu-KOL seems to be a quite large dataset, could you try "--data_path Cohere/miracl-zh-queries-22-12 \ "
By using the dataset you suggested, I was able to train the SFT model. However, can you suggest any method so I can train on larger datasets? BTW, can you help me take a look at my sh file settings? :) Gracias I have added lora_dim and gradient_check and num_gpus.
# DeepSpeed Team
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
ZERO_STAGE=2
fi
mkdir -p $OUTPUT
# The Chinese data we found mostly only contain one response without another
# "rejected" response. Thus we only test the step 1 finetuning and use
# a data_split of 10,0,0 (keep all data for step 1).
deepspeed --num_gpus 1 main.py \
--data_path ./MedData \
--data_split 10,0,0 \
--model_name_or_path bigscience/bloom-1b1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--max_seq_len 512 \
--learning_rate 9.65e-6 \
--weight_decay 0. \
--num_train_epochs 16 \
--gradient_accumulation_steps 1 \
--lora_dim 128 \
--gradient_checkpointing \
--lr_scheduler_type cosine \
--num_warmup_steps 0 \
--seed 1234 \
--zero_stage $ZERO_STAGE \
--deepspeed \
--output_dir $OUTPUT \
&> $OUTPUT/training.log
By using the dataset you suggested, I was able to train the SFT model. However, can you suggest any method so I can train on larger datasets?
Right now we don't have any other solution. But there is a plan of adding a new feature in order to reduce data CPU memory consumption https://github.com/microsoft/DeepSpeedExamples/issues/450, but this feature will take some time.
BTW, can you help me take a look at my sh file settings? :) Gracias I have added lora_dim and gradient_check and num_gpus.
Sorry but our team won't be able to provide hyperparameter tuning guide to every individual user due to bandwidth limitations.
Hey, I ran into trouble when I do
bash training_scripts/other_language/run_chinese.sh
I have uploaded the logs below. Im renting A40 on the cloud, using CUDA 11.8 in a conda env. The training got stuck in the following step, it does not seem to going anywhere.