[BUG] 'Invalidate trace cache' with Seq2SeqTrainer+predict_with_generate+Zero3

Describe the bug Evaluating transformers Seq2SeqTrainer with 'predict_with_generate=True' results in 'Invalidate trace cache' warnings. The warnings appear inside the prediction_step of the Seq2SeqTrainer. Twice during each prediction_step: Here: generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs) and here: outputs = model(**inputs) The error messages:

Invalidate trace cache @ step 0: expected module 2, but got module 0
Invalidate trace cache @ step 1: expected module 464, but got module 2

Call Stack: MyTrainer.evaluate->Trainer.evaluate->Trainer.evaluation_loop->Seq2SeqTrainer.prediction_step->'Invalidate Trace Cache'

To Reproduce I built a simple script to reproduce the error. A little bit of background first: Seq2SeqTrainer.prediction_step has a small check at the beginning:

if not self.args.predict_with_generate or prediction_loss_only:
            return super().prediction_step(
                model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys
            )

This means that (1) predict_with_generate=True needs to be in the training args, but also (2) prediction_loss_only needs to be None or False. Otherwise we wouldn't acutally predict with generate. prediction_loss_only will be automaticly set to True by trainer.evaluate when there are no compute_metrics. Thats why compute_metrics are included in the script. Note: We could also subclass the Seq2SeqTrainer.prediction_step and pass prediction_loss_only=False on to the superclass for testing purposes.

I run the script like this: deepspeed --include localhost:1 main_trainer_simple.py

main_trainer_simple.py:

import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset, load_metric

def main():
    # Defining training arguments
    training_args = Seq2SeqTrainingArguments(
        output_dir="/modelSave",
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        predict_with_generate=True,
        bf16=True,
        fp16=False,
        do_train = False,
        do_eval = True,
        logging_dir='/logging',
        learning_rate=3e-05,
        weight_decay=0.01,
        deepspeed="ds_stage3_simple.json",
        generation_max_length=128,
        generation_num_beams=1
    )
    # Defining the model:
    model = "bigscience/T0_3B"
    #model = "facebook/bart-large-cnn"

    # Initialize the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model)

    #Loading a dataset and creating an eval_dataset
    dataset = load_dataset("cnn_dailymail", "3.0.0")
    def preprocess_function(examples):
        inputs = ["summarize: " + doc for doc in examples["article"]]
        targets = examples["highlights"]
        model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
        labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length").input_ids
        model_inputs["labels"] = labels
        return model_inputs
    dataset = dataset.map(preprocess_function, batched=True)
    dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
    eval_dataset = dataset["validation"].select(range(10))

    # compute_metrics generated by ChatGPT. They will be called after the warnings.
    def compute_metrics(eval_preds):
        preds, labels = eval_preds
        if isinstance(preds, tuple):
            preds = preds[0]

        # Ensure preds are within the valid range of token IDs
        # Remove any values that are -100 or less
        preds = np.where(preds < 0, tokenizer.pad_token_id, preds)
        preds = preds.clip(0, tokenizer.vocab_size - 1)

        # Decode predictions and labels
        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

        # Ensure labels have valid token IDs
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        # Compute ROUGE scores
        rouge = load_metric("rouge")
        result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

        # Extract the individual ROUGE scores
        result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

        # Add mean generated length
        prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
        result["gen_len"] = np.mean(prediction_lens)

        return result

    # Initializing model and trainer
    model = AutoModelForSeq2SeqLM.from_pretrained(model)
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics
    )

    # Evaluating eval_dataset
    results = trainer.evaluate()
    print("Printing results:")
    print(results)

if __name__ == "__main__":
    main()

ds_config3_simple.json:

{
    "bf16":{
        "enabled":"auto"
    },
    "fp16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

Expected behavior I excpected no warnings. The problem also slows down execution. It is currently faster not to use deepspeed.

ds_report output

[2024-06-14 16:46:59,976] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] please install triton==1.0.0 if you want to use sparse attention

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] fp_quantizer ........... [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/XXX/miniconda3/envs/seq2seqnew2/lib/python3.12/site-packages/torch'] torch version .................... 2.3.1+cu121 deepspeed install path ........... ['/XXX/miniconda3/envs/seq2seqnew2/lib/python3.12/site-packages/deepspeed'] deepspeed info ................... 0.14.3, unknown, unknown torch cuda version ............... 12.1 torch hip version ................ None nvcc version ..................... 12.5 deepspeed wheel compiled w. ...... torch 2.3, cuda 12.5 shared memory (/dev/shm) size .... 503.86 GB

Screenshots invalidateTraceCache

System info (please complete the following information): OS: Ubuntu 22.04.4 LTS GPU: 3x RTX A6000 (no difference between single or multi-gpu) Python version: 3.12.2 | packaged by conda-forge Transformers version: 4.41.2 Datasets version: 2.20.0 Numpy version: 1.26.4 DeepSpeed version: 0.14.3 Torch version: 2.3.1+cu121 -> All packages are installed within a conda env

Additional context I also read about deepspeed fastgen/mii, but there is no support for T5 - the model im currently using - yet.

microsoft / DeepSpeed