tatsu-lab / stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.
https://crfm.stanford.edu/2023/03/13/alpaca.html
Apache License 2.0
29.42k stars 4.03k forks source link

Train using 13b llama model #96

Open dev2021-ctrl opened 1 year ago

dev2021-ctrl commented 1 year ago

This repo is awesome .please let me know steps to use llama 13b to train similar json data like alpaca_data.json

I have my custom data content and want to train. pls let me know steps to do the same. ALso can i use colab or paperspace. The json data file is not more than 100 mb so let me know how much GPU required for training. A bit urgent

Thanks

raj-swype commented 1 year ago

would love to help -- rajkhare@andrew.cmu.edu

sabetAI commented 1 year ago

Hey @raj-swype I got the model to train, but the weights aren't fully saved during checkpointing. According to the hf deepspeed docs, the model state is supposed to be saved in a global_step/optim_states.pt, but these are missing. I'm using deepspeed==0.8.3, transformers==4.27.0.dev0, accelerate==0.18.0, and torch==2.0.0, my deepspeed config is

# ZeRO-3.json
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

with similar runtime arguments as

torchrun \
    --nnodes=$HOST_NUM \
    --nproc_per_node=$HOST_GPU_NUM \
    --rdzv_id=$TJ_INSTANCE_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$CHIEF_IP \
    --master_port=12345 \
train.py \
    --model_name_or_path $MODEL_PATH \
    --train_data_path $DATA \
    --bf16 True \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "steps" \
    --eval_steps 2000 \
    --save_strategy "steps" \
    --save_steps 2000 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --deepspeed ./deepspeed-cfg/ZeRO-3.json
luffycodes commented 1 year ago

if you are using: https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py, then in line 218: replace trainer.save_model(output_dir=training_args.output_dir) with

    checkpoint_dir = os.path.join(training_args.output_dir, "checkpoint-final")
    trainer.deepspeed.save_checkpoint(checkpoint_dir)

then, checkpoint-final will contains zero_to_fp32.py after the training is done. just run python zero_to_fp32.py . pytorch_model.bin

for more information, look here: https://huggingface.co/transformers/v4.10.1/main_classes/deepspeed.html#getting-the-model-weights-out

sabetAI commented 1 year ago

Works! Thanks luffycodes 🙏 !