AssertionError: AutoTP not supported for model. Please use kernel injection since container policy for model exists.

Describe the bug I am trying to finte tune a Transformer EncoderDecoder model on T4. I am using Longformer as Encoder and GPT2 as Decoder for this. While I am successfully able to train the model, while inference, I getting the following error -

AssertionError: AutoTP not supported for model. Please use kernel injection since container policy for model exists.

This is happening because I have set replace_with_kernel_inject=False in my init_inference function. If I set replace_with_kernel_inject=True I am getting #1301 error which might be because I am running my code on T4.

To Reproduce Steps to reproduce the behavior:

Simple inference script to reproduce Training Loop -

def training_loop(model_name, model, \
              model_parameters, \
              ds_config, \
              train_dl, valid_dl):

best_loss = np.inf
best_model = None
best_epoch = 0
for epoch in tqdm(range(code_config.TASKA_SUMMARY_EPOCHS)):
    avg_train_loss, model = \
    train_summarization(model, \
                        model_parameters, \
                        ds_config, \
                        train_dl, \
                        epoch)
    if model.training is False:
        raise Exception("Model has to be trainable")
    new_loss = \
    validate_summarization(model,valid_dl,epoch)
    wandb.log({"Epoch/Training Loss":avg_train_loss, \
               "Epoch/Validation Loss":new_loss, \
               "Epoch/Epoch":epoch})
    if new_loss < best_loss:
        if model is None:
            raise Exception("Best Model cannot be none")
        best_loss = new_loss
wandb.finish()

validate_summarization -

def validate_summarization(model,valid_dl,epoch=0):
    seed_everything(code_config.TASKA_SUMMARY_SEED)

    if model is None:
        raise Exception("Model cannot be None")

    world_size = int(os.getenv('WORLD_SIZE', '4'))

    model_engine_inference = deepspeed.init_inference(model,
                                                      mp_size=world_size,
                                                      dtype=torch.float,
                                                      replace_with_kernel_inject=False)

    model_engine_inference.eval()

    if model_engine_inference.training is True:
        raise Exception("Model should not be trainable")

    total_loss = 0
    for valid_step, valid_batch in enumerate(valid_dl):

        input_ids = valid_batch["input_ids"].to(device)
        attention_mask = valid_batch["attention_mask"].to(device)
        labels = valid_batch["labels"].to(device)
        decoder_input_ids = valid_batch["decoder_input_ids"].to(device)

        with torch.no_grad():
            output = model_engine_inference(input_ids=input_ids, \
                                            attention_mask=attention_mask, \
                                            decoder_input_ids=decoder_input_ids, \
                                            labels=labels, \
                                            use_cache=False, \
                                            return_dict=True)
            loss = output.loss
            total_loss += loss.item()

            wandb.log({'Batch/Validation Loss':loss.item(), \
                       'Batch/Validation Step':valid_step+epoch*len(valid_dl)})

    avg_loss = total_loss / len(valid_dl)

    return avg_loss

Stacktrace -

[2023-04-09 19:46:52,126] [INFO] [logging.py:93:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
  0%|          | 0/2 [05:35<?, ?it/s]
Traceback (most recent call last):
  File "Task A - Summarization - EncoderDecoder.py", line 550, in <module>
    training_loop(model_name, model, \
  File "Task A - Summarization - EncoderDecoder.py", line 482, in training_loop
    validate_summarization(model,valid_dl,epoch)
  File "Task A - Summarization - EncoderDecoder.py", line 287, in validate_summarization
    model_engine_inference = deepspeed.init_inference(model,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 139, in __init__
    parser_dict = AutoTP.tp_parser(model)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/module_inject/auto_tp.py", line 99, in tp_parser
    assert AutoTP.supported(model), "AutoTP not supported for model. Please use kernel injection since container policy for model exists." \
AssertionError: AutoTP not supported for model. Please use kernel injection since container policy for model exists.

Latest versions of DeepSpeed and Huggingface
How to run the script - NA
...

Expected behavior How do I get rid of this error? My target environment is K80 and so getting rid of this solution is extremely important to me.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 22.04]
GPU count and types [e.g. one machine with T4]
(if applicable) what DeepSpeed-MII version are you using - Latest
(if applicable) Hugging Face Transformers/Accelerate/etc. versions - Latest
Python version - 3.8
Any other relevant info about your setup

Docker context Dockerfile -

FROM nvcr.io/nvidia/pytorch:23.03-py3

WORKDIR /workspace

COPY requirements.txt .

RUN pip install -r requirements.txt

Requirements.txt

jupyter
protobuf<3.21.0a0,>=3.20.1
tornado<6.2,>=6.0.3
jupyterlab_widgets
ipywidgets
plotly
nltk
transformers
datasets
rouge-score
bert-score
evaluate
gputil
wandb
iterative-stratification
tensorflow
tf-slim
git+https://github.com/google-research/bleurt.git
captum
sentence-transformers
deepdiff
setfit
optuna
torch-lr-finder
openai
fire
tenacity
accelerate
loralib
deepspeed
peft

Additional context Add any other context about the problem here.

microsoft / DeepSpeed

AssertionError: AutoTP not supported for model. Please use kernel injection since container policy for model exists. #3174