microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.95k stars 4.06k forks source link

[BUG] Zero3 Checkpointing doesn't include HF T5's token embeddings #1893

Open m3rlin45 opened 2 years ago

m3rlin45 commented 2 years ago

Describe the bug When HuggingFace T5 models are checkpointed, the embed_tokens modules in both the encoder and decoder are not included.

This is also captured in https://github.com/PyTorchLightning/pytorch-lightning/issues/10964

Expected behavior I expect all model parameters to be included in checkpoints

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... /*removed*/
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... /* removed *//deepspeed']
deepspeed info ................... 0.6.1, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3, hip 0.0

System info (please complete the following information): Single DGX-1 V100 with Ubunt 20.04.2 LTS Python 3.7

Launcher context Launching with accelerate

To Reproduce I have a simple script that I've used to reproduce this issue on a single DGX-1 node. I'm using hf accelerate to run it on all 8 GPUs, but any similar launcher should work

from deepspeed.ops.adam import FusedAdam
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
)
from transformers.deepspeed import HfDeepSpeedConfig
from accelerate import Accelerator
from torch.utils.data import DataLoader
import torch

from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint, get_fp32_state_dict_from_zero_checkpoint

MODEL_ID = "google/t5-large-lm-adapt"
BATCH_SIZE = 1
WEIGHT_DECAY = 0.01
OUTPUT_DIR = "/tmp/checkpoint_tests_dump_dir/2/"

def main():
    accelerator = Accelerator()
    backup_model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID)

    ds_config = accelerator.state.deepspeed_plugin.deepspeed_config
    ds_config["train_batch_size"] = accelerator.num_processes * BATCH_SIZE
    hf_deepspeed_config = HfDeepSpeedConfig(ds_config)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID)
    model.gradient_checkpointing_enable()

    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if not any(nd in n for nd in no_decay)
            ],
            "weight_decay": WEIGHT_DECAY,
        },
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.0,
        },
    ]

    optimizer = FusedAdam(optimizer_grouped_parameters, lr=1e-4)

    dataset = ["hi"]*16
    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE)

    (model, optimizer, dataloader) = accelerator.prepare(model, optimizer, dataloader)

    model.train()
    model.eval()
    accelerator.wait_for_everyone()
    accelerator.deepspeed_engine.save_checkpoint(OUTPUT_DIR)
    accelerator.wait_for_everyone()

    loaded_model = backup_model
    state_dict = get_fp32_state_dict_from_zero_checkpoint(OUTPUT_DIR)
    loaded_model.load_state_dict(state_dict)

    accelerator.print("They are the same: {is_same}")

if __name__ == "__main__":
    main()
m3rlin45 commented 2 years ago

I'm doing some debugging on my own, it seems like the proximate cause is that HuggingFace's T5 implementation does not include embed_tokens in the return values for module.named_parameters(), which means that DeepSpeed is oblivious to the embeddings.

I'm not sure why this is working at all, in that case, with part of the model being not sharded. Maybe the problem is only at save time?

tjruwase commented 2 years ago

@m3rlin45, thanks for sharing this issue and your analysis. A couple of thoughts.

  1. If embed_tokens are shared parameters then it is possible that the parameters are still optimized and sharded, see #1896.
  2. If embed_tokens are not trainable then they should not be shared anyway.

Do any of the above apply in this case? Thanks!