microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.65k stars 4.14k forks source link

[BUG] Deepspeed zero3 error in loading pretrained model by transformers.LlamaForCausalLM.from_pretrained function. #5250

Open Yang-bug-star opened 8 months ago

Yang-bug-star commented 8 months ago

Describe the bug A clear and concise description of what the bug is. "Unable to load weights from pytorch checkpoint file for '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin" When I try to pretrain llava that uses Llama-2-7b-chat as language model using deepspeed zero3 on 2 v100 gpus on 1 node, it shows that rank 0 process and rank 1 process both try to load '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin" simultaneously and finally an error was reported as "OSError: Unable to load weights from pytorch checkpoint file for '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin' at '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.". However, when I trained with deepspeed zero2, it works successfully. How can I fix the error with zero3? To Reproduce Steps to reproduce the behavior:

  1. zero2 config file { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "train_micro_batch_size_per_gpu": "auto", "train_batch_size": "auto", "gradient_accumulation_steps": "auto", "zero_optimization": { "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto" } }
  2. zero3 config file { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "train_micro_batch_size_per_gpu": "auto", "train_batch_size": "auto", "gradient_accumulation_steps": "auto", "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }
  3. transformers.LlamaForCausalLM.from_pretrained to load the sharded weight in huggingface format of LLama-2-7b-chat

Screenshots image image

System info (please complete the following information):

njucckevin commented 8 months ago

I have the same problem. Have you solved it?

njucckevin commented 8 months ago

I solved this problem by updating torch with torch 2.0.1 cuda 11.8. It seems that the zeros requires higher torch version.