Sample inference script for FLAN-T5 XXL using DeepSpeed & Hugging Face.

irshadbhat commented 1 year ago

Hi there,

I trained a flan-t5-xxl model following the steps from your blog. The training went well without any issues.

I ran the model for inference with an inference script for a normal huggingface seq2seq model. Just called the script with deepspeed as:

deepspeed --num_gpus=4 test_flan_t5_xxl.py --model_id /mnt/flan_modelling/flan-t5-xxl-ner-ft/checkpoint-26/

The out of the model is very random. I believe I am doing something wrong, maybe I need to pass the config file for inference as well.

Could you please provide a sample inference script. My appologies if this is quite trivial, I am fairly new to deepspeed.

Looking forward for your responce.

Best, Irshad

philschmid commented 1 year ago

What do you mean with the "out" is really random?

irshadbhat commented 1 year ago

Ohh sorry, I mean output of the model was random, Not what was expected at all.

irshadbhat commented 1 year ago

I used the below code for inference:

import sys
import time
import torch
import argparse
import numpy as np
import pandas as pd

import deepspeed
import evaluate
import datasets
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorForSeq2Seq
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

def parse_arge():
    """Parse the arguments."""
    parser = argparse.ArgumentParser()
    # add model id and dataset path argument
    parser.add_argument("--model_id", type=str, default="/mnt/flan-t5-xxl", help="Model id to use for training.")
    parser.add_argument("--per_device_train_batch_size", type=int, default=8, help="Batch size to use for training.")
    parser.add_argument("--per_device_eval_batch_size", type=int, default=8, help="Batch size to use for testing.")
    parser.add_argument("--generation_max_length", type=int, default=140, help="Maximum length to use for generation")
    parser.add_argument("--generation_num_beams", type=int, default=1, help="Number of beams to use for generation.")
    parser.add_argument("--deepspeed", type=str, default=None, help="Path to deepspeed config file.")
    args = parser.parse_known_args()
    return args

def inference_function(args):
    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
    model = AutoModelForSeq2SeqLM.from_pretrained(args.model_id)

    while True:
        text = input('Input text:\n')
        t1 = time.time()
        batch = tokenizer.prepare_seq2seq_batch(src_texts=[text], max_length=256, truncation=True, return_tensors="pt")
        output = model.generate(batch["input_ids"], max_length=128, min_length=2,  early_stopping=True, num_beams=1)#, temperature=0.8, top_p=0.75)#, top_k=10, num_beams=5)
        print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in output])
        print(time.time()-t1)

def main():
    args, _ = parse_arge()
    inference_function(args)

if __name__ == "__main__":
    main()

philschmid commented 1 year ago

whats your model id and output?

irshadbhat commented 1 year ago

--model_id /mnt/flan_modelling/flan-t5-xxl-ner-ft/checkpoint-26/

output: ['oa*lapcte ole gbo llonm l.it']

irshadbhat commented 1 year ago

I looked into deepspeed/inference from deepspeed.ai and found the end to end inference code for GPT NEO.

I updated the code to work for flan-t5-xxl as below:

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '4'))
generator = pipeline('text2text-generation', model='/mnt/flan_modelling/flan-t5-xxl-ner-ft/checkpoint-26/',
                     device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.bfloat16,
                                           replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=2)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

I have used the ds_flan_t5_z3_offload_bf16.json config file for training. So I guess I have to use dtype=torch.bfloat16.

But now I am getting CUDA OOM.

Please suggest any changes I need to do so I can use the trained model for inference.

irshadbhat commented 1 year ago

I have raised a different issue #12 with better detail. Please feel free to delete this issue.

philschmid / deep-learning-pytorch-huggingface

Sample inference script for FLAN-T5 XXL using DeepSpeed & Hugging Face. #11