tatsu-lab / stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.
https://crfm.stanford.edu/2023/03/13/alpaca.html
Apache License 2.0
29.38k stars 4.03k forks source link

Unstable inference of DeepSpeed-fine-tuned Alapaca model #153

Open XinliYu opened 1 year ago

XinliYu commented 1 year ago

We fine-tuned Alpaca on one single node with torchrun, and on multiple nodes with DeepSpeed. I am following the "demo" parameters

temperature=0.7 top_p=0.9 do_sample=True num_beams=1 max_new_tokens=600

We evaluate the fine-tuned model with the above inference setup multiple times on the same prompts. The torchrun fine-tuned model on a single node is relatively stable. However, the multi-node DeepSpeed fine-tuned model is much less stable, where the output for each inference could be different.

The typical erroneous behavior we observed for DeepSpeed-fine-tuned model is that it repeats the prompt and then stops.

For example, the prompt is "Explain how algorithms can be used in educational institutions." Three responses from torchrun fine-tuned model:

Algorithms are mathematical processes that can be used to solve problems and make decisions. In educational institutions, algorithms can be used in a variety of ways. For example, algorithms can be used to grade student work, to personalize learning experiences, to generate recommendation systems, and to detect cheating. Algorithms can also be used to analyze large amounts of data to identify patterns and trends in student performance.

Algorithms can be used in educational institutions to automate certain processes, such as grading tests and homework, providing personalized learning recommendations, and helping students find resources related to their coursework. Algorithms can also be used to track student progress, identify areas of difficulty, and provide feedback to students and teachers.

Algorithms can be used in educational institutions to help with the tracking and management of student records, providing automated feedback and assessment, personalizing learning experiences, and automating administrative tasks.

Three response from DeepSpeed-finetuned model. We can see in the first and the third responses that the output just repeats the prompt.

Explain how algorithms can be used in educational institutions.

Algorithms can be used in educational institutions to streamline processes and make them more efficient. For example, algorithms can be used to grade tests and assignments quickly and accur, accurately. Algorithms can also be used to match students with appropriate tutors and to match students with suitable learning materials.

Explain how algorithms can be used in educational institutions.

We have tried to adjust temperature for inference but does not solve this issue.

Looking forward to any helpful discussion how to make the DeepSpeed fine-tuned model.

The following is DeepSpeed config.

{
  "gradient_accumulation_steps": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "steps_per_print": 100,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 2e-5,
      "weight_decay": 0.0
    }
  },
  "bf16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 1,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  },
  "wall_clock_breakdown": false,
  "zero_allow_untested_optimizer": true
}

Here is the command line. We still use torchrun, but simply add a --deepspeed argument to the torchrun command line referencing the following configuration, and remove those conflicting fsdp configuration in https://github.com/tatsu-lab/stanford_alpaca.

python -m torch.distributed.run --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=xxx --master_port=9901 train.py \
    --data_path ./alpaca_data.json \
    --output_dir ./train_ouput_02 \
    --num_train_epochs 7 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 50 \
    --tf32 True \
    --deepspeed ds_config.json
HaishuoFang commented 1 year ago

Have you found the reason?