microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.9k stars 4.06k forks source link

Zero Stage-2 Frozen Layers[BUG] #4055

Open KeremTurgutlu opened 1 year ago

KeremTurgutlu commented 1 year ago

Describe the bug

I saw this issue has been resolved: https://github.com/microsoft/DeepSpeed/issues/2615 and tried freezing some layers of the model. Same training script works fine without the following.

for name, param in model.named_parameters():
    if any(ln in name for ln in ["embed", "lm_head"]):
        param.requires_grad = True
    else:
        param.requires_grad = False
Traceback (most recent call last):
  File "pretrain_deepspeed.py", line 947, in <module>
    main()
  File "pretrain_deepspeed.py", line 456, in main
    model, optimizer, train_dl, valid_dl, lr_scheduler = accelerator.prepare(
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1198, in prepare
    result = self._prepare_deepspeed(*args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 310, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1209, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1444, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 312, in __init__
    self.flatten_dense_tensors_aligned(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 834, in flatten_dense_tensors_aligned
    return self.flatten(align_dense_tensors(tensor_list, alignment))
  File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 451, in _flatten_dense_tensors
    return torch._C._nn.flatten_dense_tensors(tensors)
RuntimeError: torch.cat(): expected a non-empty list of Tensors
{
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": 21,
    "gradient_clipping": 1.0,
    "steps_per_print": 1000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

This is a work around but not favorable because of unnecessary resource usage:

    if args.freeze_all_but_embed:        
        no_decay = ["bias", "norm.weight"]
        freeze_but = ["embed", "lm_head"]
        optimizer_grouped_parameters = [
            # embedding and lm_head layers.
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in freeze_but)],
                "weight_decay": args.weight_decay,
                "lr": args.learning_rate
            },
            # all layers excluding embedding, lm_head and layer/rmsnorm.
            {
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay+freeze_but)],
                "weight_decay": 0.0, # args.weight_decay
                "lr": 0.0
            },
            # layer/rmsnorm.
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
                "lr": 0.0
            }
        ]
tjruwase commented 1 year ago

@KeremTurgutlu, thanks for reporting this issue. Can you please provide an easy repro? Thanks!

insoochung commented 1 year ago

@tjruwase, I'm running into a similar issue. My guess is that if you run this test with zero stage-2 optimizer the same problem should occur.

Edit: just tested TestZeroFrozenWeights::test with zero stage set to 2, which fails... but emits a different log compared to what OP posted.

tjruwase commented 1 year ago

@insoochung, I am not sure the unit test failure is the same. I have created #4140 to fix the unit test problem and generalize zero stages 1 and 2. If you are using zero.Init() in your code, then perhaps you are experiencing the same API issue of the unit test. Otherwise, could you share a repro to help us investigate.

@KeremTurgutlu, FYI

superaha commented 1 year ago

@tjruwase I can run zero2 with the frozen weights in middle layers. However, currently I experienced other issues:

using the same parameters: Zero2 shows very unstable losses, meanwhile, zero3 shows smoothing loss over the training.

Do you have any idea on solving this issue? Thx.

tjruwase commented 1 year ago

@superaha, could you please open a separate ticket and share steps to repro the issue with zero2 and zero3? Thanks!

rucnyz commented 9 months ago

Hi @tjruwase, I'm running into the same issue. I will provide a repro here (I have simplified it as much as possible).

Please put these three files in the same directory (remember to change the first two .txt -> .py and deepspeed_config.txt -> deepspeed_config.yaml), and reproduce the result with:

accelerate launch --config_file "deepspeed_config.yaml" train_test.py --model_name "NousResearch/Llama-2-7b-hf" \
--dataset_name "smangrul/code-chat-assistant-v1" --max_seq_len 512 --max_steps 1000 --logging_steps 25 --eval_steps 100 \
--save_steps 500 --bf16 True --packing True --output_dir "full-finetune-llama-chat-asst" --per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 --dataset_text_field "content" --use_gradient_checkpointing --learning_rate 5e-5  \
--lr_scheduler_type "cosine" --weight_decay 0.01 --warmup_ratio 0.03 --use_flash_attn True

train_test.txt utils.txt deepspeed_config.txt

To save your time, you only need to review lines 147 to 149 in the file train_test.py. Currently, the code runs fine, but if you uncomment these three lines, the code will throw an error as follows:

# for param in model.parameters():
#     param.requires_grad = False
# model.get_input_embeddings().requires_grad = True

errors:

Traceback (most recent call last):
  File "/home/yuzhounie/projects/DHS-LLM-Workshop/chat_assistant/training/train_test.py", line 190, in <module>
    main(args)
  File "/home/yuzhounie/projects/DHS-LLM-Workshop/chat_assistant/training/train_test.py", line 184, in main
    trainer.train()
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1689, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
    result = self._prepare_deepspeed(*args)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 146, in __init__
    self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range
olegsinavski commented 8 months ago

Hello, I had the same error while training with pytorch lightning. While making a simple reproduction, I found the issue and a workaround:)

Reproduction:

from deepspeed.ops.adam import FusedAdam
from pytorch_lightning import Trainer, LightningModule
from transformers import AutoModelForCausalLM, LlamaConfig
import torch
from torch.utils.data import DataLoader, TensorDataset

class Module(LightningModule):
    def training_step(self, batch: torch.Tensor, batch_idx: int) -> torch.Tensor:
        raise Exception("Shouldn't reach here")

    def configure_model(self) -> None:
        self.model = AutoModelForCausalLM.from_config(
            LlamaConfig(n_layer=2, n_head=6, n_embd=192))
        self.model.requires_grad_(False)
        list(self.model.parameters())[-1].requires_grad = True  # unfreeze the head

    def configure_optimizers(self):
        optim_groups = [
            {"params": [p for n, p in self.model.named_parameters() if 'norm' not in n], "weight_decay": 0.1},
            {"params": [p for n, p in self.model.named_parameters() if 'norm' in n], "weight_decay": 0.0},
        ]
        return FusedAdam(optim_groups, lr=0.1)

trainer = Trainer(
    precision='bf16',
    strategy='deepspeed_stage_2'
)
trainer.fit(Module(), DataLoader(TensorDataset(torch.arange(0, 100).unsqueeze(1))))

The problem is that the optimizer receives an entire group that has requires_grad=False, hence the fix is to simply filter parameter groups by requires_grad. This doesn't happen if you just pass Optimizer(self.model.parameters()) since there is one unfrozen parameter in the single group.

Probably, its a good idea to improve error messaging in deepspeed code though.

pharaouk commented 6 months ago

@olegsinavski thanks for sharing, what was the fix exactly?