Open KeremTurgutlu opened 1 year ago
@KeremTurgutlu, thanks for reporting this issue. Can you please provide an easy repro? Thanks!
@tjruwase, I'm running into a similar issue. My guess is that if you run this test with zero stage-2 optimizer the same problem should occur.
Edit: just tested TestZeroFrozenWeights::test
with zero stage set to 2, which fails... but emits a different log compared to what OP posted.
@insoochung, I am not sure the unit test failure is the same. I have created #4140 to fix the unit test problem and generalize zero stages 1 and 2. If you are using zero.Init()
in your code, then perhaps you are experiencing the same API issue of the unit test. Otherwise, could you share a repro to help us investigate.
@KeremTurgutlu, FYI
@tjruwase I can run zero2 with the frozen weights in middle layers. However, currently I experienced other issues:
using the same parameters: Zero2 shows very unstable losses, meanwhile, zero3 shows smoothing loss over the training.
Do you have any idea on solving this issue? Thx.
@superaha, could you please open a separate ticket and share steps to repro the issue with zero2 and zero3? Thanks!
Hi @tjruwase, I'm running into the same issue. I will provide a repro here (I have simplified it as much as possible).
Please put these three files in the same directory (remember to change the first two .txt -> .py
and deepspeed_config.txt -> deepspeed_config.yaml
), and reproduce the result with:
accelerate launch --config_file "deepspeed_config.yaml" train_test.py --model_name "NousResearch/Llama-2-7b-hf" \
--dataset_name "smangrul/code-chat-assistant-v1" --max_seq_len 512 --max_steps 1000 --logging_steps 25 --eval_steps 100 \
--save_steps 500 --bf16 True --packing True --output_dir "full-finetune-llama-chat-asst" --per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 --dataset_text_field "content" --use_gradient_checkpointing --learning_rate 5e-5 \
--lr_scheduler_type "cosine" --weight_decay 0.01 --warmup_ratio 0.03 --use_flash_attn True
train_test.txt utils.txt deepspeed_config.txt
To save your time, you only need to review lines 147 to 149 in the file train_test.py
. Currently, the code runs fine, but if you uncomment these three lines, the code will throw an error as follows:
# for param in model.parameters():
# param.requires_grad = False
# model.get_input_embeddings().requires_grad = True
errors:
Traceback (most recent call last):
File "/home/yuzhounie/projects/DHS-LLM-Workshop/chat_assistant/training/train_test.py", line 190, in <module>
main(args)
File "/home/yuzhounie/projects/DHS-LLM-Workshop/chat_assistant/training/train_test.py", line 184, in main
trainer.train()
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1689, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
result = self._prepare_deepspeed(*args)
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 146, in __init__
self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range
Hello, I had the same error while training with pytorch lightning. While making a simple reproduction, I found the issue and a workaround:)
Reproduction:
from deepspeed.ops.adam import FusedAdam
from pytorch_lightning import Trainer, LightningModule
from transformers import AutoModelForCausalLM, LlamaConfig
import torch
from torch.utils.data import DataLoader, TensorDataset
class Module(LightningModule):
def training_step(self, batch: torch.Tensor, batch_idx: int) -> torch.Tensor:
raise Exception("Shouldn't reach here")
def configure_model(self) -> None:
self.model = AutoModelForCausalLM.from_config(
LlamaConfig(n_layer=2, n_head=6, n_embd=192))
self.model.requires_grad_(False)
list(self.model.parameters())[-1].requires_grad = True # unfreeze the head
def configure_optimizers(self):
optim_groups = [
{"params": [p for n, p in self.model.named_parameters() if 'norm' not in n], "weight_decay": 0.1},
{"params": [p for n, p in self.model.named_parameters() if 'norm' in n], "weight_decay": 0.0},
]
return FusedAdam(optim_groups, lr=0.1)
trainer = Trainer(
precision='bf16',
strategy='deepspeed_stage_2'
)
trainer.fit(Module(), DataLoader(TensorDataset(torch.arange(0, 100).unsqueeze(1))))
The problem is that the optimizer receives an entire group that has requires_grad=False
, hence the fix is to simply filter parameter groups by requires_grad
. This doesn't happen if you just pass Optimizer(self.model.parameters())
since there is one unfrozen parameter in the single group.
Probably, its a good idea to improve error messaging in deepspeed
code though.
@olegsinavski thanks for sharing, what was the fix exactly?
Describe the bug
I saw this issue has been resolved: https://github.com/microsoft/DeepSpeed/issues/2615 and tried freezing some layers of the model. Same training script works fine without the following.
This is a work around but not favorable because of unnecessary resource usage: