microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.55k stars 4.14k forks source link

Deepspeed fails with frozen weights (e.g. only train llama2 embedding layer) #4776

Open rucnyz opened 11 months ago

rucnyz commented 11 months ago

Describe the bug This bug is similar to #4055 , I provide a repro here.

To Reproduce Please put these three files in the same directory (remember to change the first two .txt -> .py and deepspeed_config.txt -> deepspeed_config.yaml), and reproduce the result with:

accelerate launch --config_file "deepspeed_config.yaml" train_test.py --model_name "NousResearch/Llama-2-7b-hf" \
--dataset_name "smangrul/code-chat-assistant-v1" --max_seq_len 512 --max_steps 1000 --logging_steps 25 --eval_steps 100 \
--save_steps 500 --bf16 True --packing True --output_dir "full-finetune-llama-chat-asst" --per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 --dataset_text_field "content" --use_gradient_checkpointing --learning_rate 5e-5  \
--lr_scheduler_type "cosine" --weight_decay 0.01 --warmup_ratio 0.03 --use_flash_attn True

train_test.txt utils.txt deepspeed_config.txt

Currently, the code runs fine, but if I uncomment these three lines (147 to 149 in the file train_test.py), the code will throw an error as follows:

# for param in model.parameters():
#     param.requires_grad = False
# model.get_input_embeddings().requires_grad = True

errors:

Traceback (most recent call last):
  File "/home/yuzhounie/projects/DHS-LLM-Workshop/chat_assistant/training/train_test.py", line 190, in <module>
    main(args)
  File "/home/yuzhounie/projects/DHS-LLM-Workshop/chat_assistant/training/train_test.py", line 184, in main
    trainer.train()
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1689, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
    result = self._prepare_deepspeed(*args)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 146, in __init__
    self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range

System info (please complete the following information):

Launcher context accelerate launch

freckletonj commented 7 months ago

I was freezing my input embeddings the same way as you, using deepspeed2 and the resulting weights can't be read back in, maybe related?

for param in emb.parameters():
    param.requires_grad = False

And getting the same problem, where it can't re load the weights because of a missing emb.weight

I've dropped a breakpoint() here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/utils/zero_to_fp32.py#L105

and observed this:

(Pdb) [x for x in state_dict['module'] if 'emb' in x]
['_forward_module.emb.weight']

(Pdb) [x for x in state_dict[PARAM_SHAPES] if 'emb' in x]
[]

(Pdb) state_dict[FROZEN_PARAM_SHAPES]
None

So they're in the state_dict, but not in the state_dict[FROZEN_PARAM_SHAPES].

This is as far as I've been able to debug, hopefully this helps more debuggage.

edit: i've also confirmed that the only place in the entire state_dict that my emb shows up at is under 'module':

{'module': OrderedDict([('_forward_module.emb.weight', tensor([[ ...]])]), ...}
jomayeri commented 1 month ago

The repro script exists with the error Image