Zero Stage-2 Frozen Layers[BUG] #4055

Open KeremTurgutlu opened 1 year ago

KeremTurgutlu commented 1 year ago

Describe the bug

I saw this issue has been resolved: and tried freezing some layers of the model. Same training script works fine without the following.

for name, param in model.named_parameters():
    if any(ln in name for ln in ["embed", "lm_head"]):
        param.requires_grad = True
        param.requires_grad = False
Traceback (most recent call last):
  File "", line 947, in <module>
  File "", line 456, in main
    model, optimizer, train_dl, valid_dl, lr_scheduler = accelerator.prepare(
  File "/usr/local/lib/python3.8/dist-packages/accelerate/", line 1198, in prepare
    result = self._prepare_deepspeed(*args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/", line 1537, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/", line 310, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/", line 1209, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/", line 1444, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/", line 312, in __init__
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/", line 834, in flatten_dense_tensors_aligned
    return self.flatten(align_dense_tensors(tensor_list, alignment))
  File "/usr/local/lib/python3.8/dist-packages/torch/", line 451, in _flatten_dense_tensors
    return torch._C._nn.flatten_dense_tensors(tensors)
RuntimeError: expected a non-empty list of Tensors
    "bf16": {
        "enabled": true
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        "offload_param": {
            "device": "none",
            "pin_memory": true
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    "gradient_accumulation_steps": 21,
    "gradient_clipping": 1.0,
    "steps_per_print": 1000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false

This is a work around but not favorable because of unnecessary resource usage:

    if args.freeze_all_but_embed:        
        no_decay = ["bias", "norm.weight"]
        freeze_but = ["embed", "lm_head"]
        optimizer_grouped_parameters = [
            # embedding and lm_head layers.
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in freeze_but)],
                "weight_decay": args.weight_decay,
                "lr": args.learning_rate
            # all layers excluding embedding, lm_head and layer/rmsnorm.
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay+freeze_but)],
                "weight_decay": 0.0, # args.weight_decay
                "lr": 0.0
            # layer/rmsnorm.
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
                "lr": 0.0
tjruwase commented 1 year ago

@KeremTurgutlu, thanks for reporting this issue. Can you please provide an easy repro? Thanks!

insoochung commented 1 year ago

@tjruwase, I'm running into a similar issue. My guess is that if you run this test with zero stage-2 optimizer the same problem should occur.

Edit: just tested TestZeroFrozenWeights::test with zero stage set to 2, which fails... but emits a different log compared to what OP posted.

tjruwase commented 1 year ago

@insoochung, I am not sure the unit test failure is the same. I have created #4140 to fix the unit test problem and generalize zero stages 1 and 2. If you are using zero.Init() in your code, then perhaps you are experiencing the same API issue of the unit test. Otherwise, could you share a repro to help us investigate.

@KeremTurgutlu, FYI

superaha commented 1 year ago

@tjruwase I can run zero2 with the frozen weights in middle layers. However, currently I experienced other issues:

using the same parameters: Zero2 shows very unstable losses, meanwhile, zero3 shows smoothing loss over the training.

Do you have any idea on solving this issue? Thx.

tjruwase commented 1 year ago

@superaha, could you please open a separate ticket and share steps to repro the issue with zero2 and zero3? Thanks!

rucnyz commented 9 months ago

Hi @tjruwase, I'm running into the same issue. I will provide a repro here (I have simplified it as much as possible).

Please put these three files in the same directory (remember to change the first two .txt -> .py and deepspeed_config.txt -> deepspeed_config.yaml), and reproduce the result with:

accelerate launch --config_file "deepspeed_config.yaml" --model_name "NousResearch/Llama-2-7b-hf" \
--dataset_name "smangrul/code-chat-assistant-v1" --max_seq_len 512 --max_steps 1000 --logging_steps 25 --eval_steps 100 \
--save_steps 500 --bf16 True --packing True --output_dir "full-finetune-llama-chat-asst" --per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 --dataset_text_field "content" --use_gradient_checkpointing --learning_rate 5e-5  \
--lr_scheduler_type "cosine" --weight_decay 0.01 --warmup_ratio 0.03 --use_flash_attn True

To save your time, you only need to review lines 147 to 149 in the file Currently, the code runs fine, but if you uncomment these three lines, the code will throw an error as follows:

# for param in model.parameters():
#     param.requires_grad = False
# model.get_input_embeddings().requires_grad = True


Traceback (most recent call last):
  File "/home/yuzhounie/projects/DHS-LLM-Workshop/chat_assistant/training/", line 190, in <module>
  File "/home/yuzhounie/projects/DHS-LLM-Workshop/chat_assistant/training/", line 184, in main
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/", line 1555, in train
    return inner_training_loop(
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/", line 1689, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/", line 1284, in prepare
    result = self._prepare_deepspeed(*args)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/", line 1666, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/", line 304, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/", line 1225, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/", line 1552, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/zero/", line 146, in __init__
    self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range
olegsinavski commented 8 months ago

Hello, I had the same error while training with pytorch lightning. While making a simple reproduction, I found the issue and a workaround:)


from deepspeed.ops.adam import FusedAdam
from pytorch_lightning import Trainer, LightningModule
from transformers import AutoModelForCausalLM, LlamaConfig
import torch
from import DataLoader, TensorDataset

class Module(LightningModule):
    def training_step(self, batch: torch.Tensor, batch_idx: int) -> torch.Tensor:
        raise Exception("Shouldn't reach here")

    def configure_model(self) -> None:
        self.model = AutoModelForCausalLM.from_config(
            LlamaConfig(n_layer=2, n_head=6, n_embd=192))
        list(self.model.parameters())[-1].requires_grad = True  # unfreeze the head

    def configure_optimizers(self):
        optim_groups = [
            {"params": [p for n, p in self.model.named_parameters() if 'norm' not in n], "weight_decay": 0.1},
            {"params": [p for n, p in self.model.named_parameters() if 'norm' in n], "weight_decay": 0.0},
        return FusedAdam(optim_groups, lr=0.1)

trainer = Trainer(
), DataLoader(TensorDataset(torch.arange(0, 100).unsqueeze(1))))

The problem is that the optimizer receives an entire group that has requires_grad=False, hence the fix is to simply filter parameter groups by requires_grad. This doesn't happen if you just pass Optimizer(self.model.parameters()) since there is one unfrozen parameter in the single group.

Probably, its a good idea to improve error messaging in deepspeed code though.

pharaouk commented 6 months ago

@olegsinavski thanks for sharing, what was the fix exactly?