Running Chinese example encountered problem

TommyTunaToro commented 1 year ago

Hey, I ran into trouble when I do bash training_scripts/other_language/run_chinese.sh I have uploaded the logs below. Im renting A40 on the cloud, using CUDA 11.8 in a conda env. The training got stuck in the following step, it does not seem to going anywhere.

[2023-05-01 14:29:21,743] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-01 14:29:21,758] [INFO] [runner.py:540:main] cmd = /root/miniconda3/envs/DS/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path wangrui6/Zhihu-KOL Cohere/miracl-zh-queries-22-12 Hello-SimpleAI/HC3-Chinese mkqa-Chinese --data_split 10,0,0 --model_name_or_path bigscience/bloom-1b1 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 16 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 2 --deepspeed --output_dir ./output
[2023-05-01 14:29:24,225] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-01 14:29:24,225] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-01 14:29:24,225] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-01 14:29:24,225] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-01 14:29:24,225] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-01 14:29:27,681] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Found cached dataset parquet (/root/.cache/huggingface/datasets/wangrui6___parquet/wangrui6--Zhihu-KOL-38afbff90366253f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 42.74it/s]
[2023-05-01 14:45:44,264] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15240
[2023-05-01 14:45:44,265] [ERROR] [launch.py:434:sigkill_handler] ['/root/miniconda3/envs/DS/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'wangrui6/Zhihu-KOL', 'Cohere/miracl-zh-queries-22-12', 'Hello-SimpleAI/HC3-Chinese', 'mkqa-Chinese', '--data_split', '10,0,0', '--model_name_or_path', 'bigscience/bloom-1b1', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', './output'] exits with return code = -9

conglongli commented 1 year ago

"exits with return code = -9" probably indicates out of memory problem and this seems happened during data preparation. Could you try reducing the number of datasets: change the https://github.com/microsoft/DeepSpeedExamples/blob/9a586b1b9852fcc6005652170e12cace8c914fbb/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language/run_chinese.sh#LL20C4-L20C107 to just only one dataset, for example "--data_path wangrui6/Zhihu-KOL \ ".

TommyTunaToro commented 1 year ago

Thanks conglongli, I tried that, now it is killing the program at subprocess 17883. It seems further the process with only one dataset comapring to previous subprocess 15240. I have also tried to decrease the batch size to 4, and now it is showing me killed at subprocess 20026? Does it mean I moved further with the process? Should I keep on reducing the batch size? Kindly advise, thanks! By the way, either methods returned with code -9

conglongli commented 1 year ago

-9 means you are probably still out of memory. Can you check how much CPU memory do you have? On the other hand, wangrui6/Zhihu-KOL seems to be a quite large dataset, could you try "--data_path Cohere/miracl-zh-queries-22-12 \ "

TommyTunaToro commented 1 year ago

-9 means you are probably still out of memory. Can you check how much CPU memory do you have? On the other hand, wangrui6/Zhihu-KOL seems to be a quite large dataset, could you try "--data_path Cohere/miracl-zh-queries-22-12 \ "

By using the dataset you suggested, I was able to train the SFT model. However, can you suggest any method so I can train on larger datasets? BTW, can you help me take a look at my sh file settings? :) Gracias I have added lora_dim and gradient_check and num_gpus.

# DeepSpeed Team
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
    ZERO_STAGE=2
fi
mkdir -p $OUTPUT

# The Chinese data we found mostly only contain one response without another
# "rejected" response. Thus we only test the step 1 finetuning and use
# a data_split of 10,0,0 (keep all data for step 1).
deepspeed --num_gpus 1 main.py \
   --data_path ./MedData \
   --data_split 10,0,0 \
   --model_name_or_path bigscience/bloom-1b1 \
   --per_device_train_batch_size 2 \
   --per_device_eval_batch_size 2 \
   --max_seq_len 512 \
   --learning_rate 9.65e-6 \
   --weight_decay 0. \
   --num_train_epochs 16 \
   --gradient_accumulation_steps 1 \
   --lora_dim 128 \
   --gradient_checkpointing \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --zero_stage $ZERO_STAGE \
   --deepspeed \
   --output_dir $OUTPUT \
   &> $OUTPUT/training.log

conglongli commented 1 year ago

By using the dataset you suggested, I was able to train the SFT model. However, can you suggest any method so I can train on larger datasets?

Right now we don't have any other solution. But there is a plan of adding a new feature in order to reduce data CPU memory consumption https://github.com/microsoft/DeepSpeedExamples/issues/450, but this feature will take some time.

BTW, can you help me take a look at my sh file settings? :) Gracias I have added lora_dim and gradient_check and num_gpus.

Sorry but our team won't be able to provide hyperparameter tuning guide to every individual user due to bandwidth limitations.

microsoft / DeepSpeedExamples

Running Chinese example encountered problem #465