The saved model with deepspeed zero3 can not be correctly loaded

rubickkcibur commented 3 months ago

System Info

- `Accelerate` version: 0.30.1
- Platform: Linux-5.4.0-177-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /home/rubickjiang/anaconda3/envs/deepspeed/bin/accelerate
- Python version: 3.10.14     
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 503.59 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'deepspeed_config_file': '/home/rubickjiang/test/ds_config.json', 'zero3_init_flag': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

First, I use trainer to train my model (Llama3-8B) and save it: trainer = MyTrainer( modelL, training_args, train_dataset=train_dataset, ) trainer.train() trainer.save_model()

the ds config is: ds_config.json

Then, the file tree of saved checkpoint is:

checkpoint-2772 |-config.json |-generation_config.json |-training_args.bin |-global_step2772 |-zero_to_fp32.py |-latest |-trainer_state.json

I run zero_to_fp32.py to recover the weights, by running: python zero_to_fp32.py . pytorch_model.bin

Aftering the generating process, the file tree of saved checkpoint is:

checkpoint-2772 |-config.json |-generation_config.json |-training_args.bin |-global_step2772 |-zero_to_fp32.py |-latest |-trainer_state.json |-pytorch_model.bin

Finally, I load this checkpoint by transformers.AutoModelForCausalLM.from_pretrained("checkpoint-2772",torch_dtype=torch.bfloat16), and make inference by modelL.generate( **prompts_tokenized, max_length=training_args.model_max_length, num_return_sequences=1, temperature=0.7, pad_token_id=tokenizerL.eos_token_id, ) And I get this error:

Traceback (most recent call last):
rank2: File "/home/rubickjiang/adadf/evaluation.py", line 293, in

rank2: File "/home/rubickjiang/adadf/evaluation.py", line 221, in evaluation_main

rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
rank2: return func(*args, kwargs)
rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate
rank2: result = self._sample( rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/transformers/generation/utils.py", line 2397, in _sample rank2: outputs = self(
rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank2: return self._call_impl(*args, *kwargs) rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank2: return forward_call(args, kwargs) rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1164, in forward rank2: outputs = self.model( rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank2: return self._call_impl(*args, kwargs) rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank2: return forward_call(*args, *kwargs) rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 925, in forward rank2: inputs_embeds = self.embed_tokens(input_ids) rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank2: return self._call_impl(args, kwargs) rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank2: return forward_call(*args, **kwargs) rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward rank2: return F.embedding( rank2: File "/home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/functional.py", line 2264, in embedding rank2: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) rank2: RuntimeError: 'weight' must be 2-D

My deepspeed version is "0.14.3+69adeab6", accelerate version is "0.30.1", transformers version is "4.41.2"

Expected behavior

I'd expect having no error reports and the inference process goes fine

BenjaminBossan commented 3 months ago

What is the size of pytorch_model.bin, is it approximately the size that should be expected? Can you load it with torch.load?

rubickkcibur commented 3 months ago

I probably found where my problem is. During inference, I use the same accelerate config where ds config is used. When I change the config to another one without ds, this error doesn't happen.

huggingface / accelerate