[BUG] Deepspeed zero3 error in loading pretrained model by transformers.LlamaForCausalLM.from_pretrained function.

Yang-bug-star commented 8 months ago

Describe the bug A clear and concise description of what the bug is. "Unable to load weights from pytorch checkpoint file for '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin" When I try to pretrain llava that uses Llama-2-7b-chat as language model using deepspeed zero3 on 2 v100 gpus on 1 node, it shows that rank 0 process and rank 1 process both try to load '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin" simultaneously and finally an error was reported as "OSError: Unable to load weights from pytorch checkpoint file for '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin' at '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.". However, when I trained with deepspeed zero2, it works successfully. How can I fix the error with zero3? To Reproduce Steps to reproduce the behavior:

zero2 config file { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "train_micro_batch_size_per_gpu": "auto", "train_batch_size": "auto", "gradient_accumulation_steps": "auto", "zero_optimization": { "stage": 2,
"overlap_comm": true,
"contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto" } }
zero3 config file { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "train_micro_batch_size_per_gpu": "auto", "train_batch_size": "auto", "gradient_accumulation_steps": "auto", "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }
transformers.LlamaForCausalLM.from_pretrained to load the sharded weight in huggingface format of LLama-2-7b-chat

Screenshots

System info (please complete the following information):

OS: [Ubuntu 18.04]
GPU count and types [one machine with x2 v100s]
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python 3.9.1
transformers 4.37.2
deepspeed 0.12.6
pytorch 1.12.1 cuda 11.6
```
### Tasks
```

njucckevin commented 8 months ago

I have the same problem. Have you solved it?

njucckevin commented 8 months ago

I solved this problem by updating torch with torch 2.0.1 cuda 11.8. It seems that the zeros requires higher torch version.

microsoft / DeepSpeed

[BUG] Deepspeed zero3 error in loading pretrained model by transformers.LlamaForCausalLM.from_pretrained function. #5250