Open xiaolhu1224 opened 1 year ago
@xiaolhu1224 could you share which model you are using? A minimal script to reproduce the hang would also help us tremendously in the debugging process. Thanks!
Thanks @mrwyattii ! I'm finetuning llama 7B with following command:
deepspeed step1_supervised_finetuning/main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP --data_split 10,0,0 --model_name_or_path <llama_path> --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 1e-5 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --deepspeed --output_dir <output_path>
@xiaolhu1224 we do not officially support LLaMA models. However, this is on our roadmap and we are actively developing this support.
I want to save immediate ckpt in training after specfic steps while keep meeting job hang issue, how can I got it fixed? Torch 1.14 + CUDA 12.0, Transformer Engine 0.6 Code
Log