FLAN-T5 XXL using DeepSpeed fits well for training but gives OOM error for inference.

irshadbhat commented 1 year ago

Hi,

I trained a flan-t5-xxl model following the steps from your blog on a custom ner dataset. The training went well without any issues. I also put a print(labels) line in postprocess_text function to check the predicted labels while evaluation and the results were pretty good.

I used 4x A10G 24GB hardware for training with ds_flan_t5_z3_offload_bf16.json deepspeed config file.

Now I wanted to run the model for inference and I used the below deepspeed inference code to do the prediction.

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))

zero_optimization = {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": True
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": True
    },
    "overlap_comm": True,
    "contiguous_gradients": True,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": True
  }

generator = pipeline('text2text-generation', model='/mnt/flan_modelling/flan-t5-xxl-ner-ft/checkpoint-26', device=local_rank, max_length=256)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype="bf16",
                                           #replace_with_kernel_inject=False,
                                           zero=zero_optimization)
string = generator("Text: kindly give jack 533 from account ending with 4473 if possible\nIdentify the entities from the following options: global.money * global.time-period * global.city * global.cardinal * global.person-name * global.postal-code * global.region * global.language * global.org * global.email-id * global.phone-number * global.temperature", do_sample=False, max_length=50, num_beams=1, min_length=2)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

I used the same parameters from the config file to initiate the model for inference. But I am getting CUDA OutOfMemory error. I don't understand how the model fit well for training but needs more memory to do the inferencing.

I believe I am doing something wrong. Please suggest any changes I need to do so I can use the trained model for inference.

Thanks

sverneka commented 1 year ago

I tried to run the code as it is for training and at the end of each epoch it does inference on test set, I found that it was taking too long for inference and the GPU utilization was getting maxed out on p4dn.24x that has 8 A100s, 40GB.

sadahanu commented 1 year ago

+1 to the above. it takes longer time inference than running the actual training

philschmid / deep-learning-pytorch-huggingface

FLAN-T5 XXL using DeepSpeed fits well for training but gives OOM error for inference. #12