Open dev2021-ctrl opened 1 year ago
would love to help -- rajkhare@andrew.cmu.edu
Hey @raj-swype I got the model to train, but the weights aren't fully saved during checkpointing. According to the hf deepspeed docs, the model state is supposed to be saved in a global_step/optim_states.pt, but these are missing. I'm using deepspeed==0.8.3, transformers==4.27.0.dev0, accelerate==0.18.0, and torch==2.0.0, my deepspeed config is
# ZeRO-3.json
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_micro_batch_size_per_gpu": "auto"
}
with similar runtime arguments as
torchrun \
--nnodes=$HOST_NUM \
--nproc_per_node=$HOST_GPU_NUM \
--rdzv_id=$TJ_INSTANCE_ID \
--rdzv_backend=c10d \
--rdzv_endpoint=$CHIEF_IP \
--master_port=12345 \
train.py \
--model_name_or_path $MODEL_PATH \
--train_data_path $DATA \
--bf16 True \
--output_dir $OUTPUT_DIR \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "steps" \
--eval_steps 2000 \
--save_strategy "steps" \
--save_steps 2000 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--deepspeed ./deepspeed-cfg/ZeRO-3.json
if you are using: https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py,
then
in line 218: replace trainer.save_model(output_dir=training_args.output_dir)
with
checkpoint_dir = os.path.join(training_args.output_dir, "checkpoint-final")
trainer.deepspeed.save_checkpoint(checkpoint_dir)
then, checkpoint-final will contains zero_to_fp32.py after the training is done.
just run python zero_to_fp32.py . pytorch_model.bin
for more information, look here: https://huggingface.co/transformers/v4.10.1/main_classes/deepspeed.html#getting-the-model-weights-out
Works! Thanks luffycodes 🙏 !
This repo is awesome .please let me know steps to use llama 13b to train similar json data like alpaca_data.json
I have my custom data content and want to train. pls let me know steps to do the same. ALso can i use colab or paperspace. The json data file is not more than 100 mb so let me know how much GPU required for training. A bit urgent
Thanks