Describe the bug
A clear and concise description of what the bug is.
"Unable to load weights from pytorch checkpoint file for '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin"
When I try to pretrain llava that uses Llama-2-7b-chat as language model using deepspeed zero3 on 2 v100 gpus on 1 node, it shows that rank 0 process and rank 1 process both try to load '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin" simultaneously and finally an error was reported as "OSError: Unable to load weights from pytorch checkpoint file for '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin' at '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.".
However, when I trained with deepspeed zero2, it works successfully. How can I fix the error with zero3?
To Reproduce
Steps to reproduce the behavior:
Describe the bug A clear and concise description of what the bug is. "Unable to load weights from pytorch checkpoint file for '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin" When I try to pretrain llava that uses Llama-2-7b-chat as language model using deepspeed zero3 on 2 v100 gpus on 1 node, it shows that rank 0 process and rank 1 process both try to load '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin" simultaneously and finally an error was reported as "OSError: Unable to load weights from pytorch checkpoint file for '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin' at '../llama/hug_llama_2_7b_chat/pytorch_model-00001-of-00003.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.". However, when I trained with deepspeed zero2, it works successfully. How can I fix the error with zero3? To Reproduce Steps to reproduce the behavior:
"overlap_comm": true,
"contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto" } }
Screenshots
System info (please complete the following information):