microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.69k stars 4.04k forks source link

Issues with running inference on quantized model #1454

Open HamidShojanazeri opened 2 years ago

HamidShojanazeri commented 2 years ago

I have a question on MoQ usage and appreciate the help. I have run the example that train a bert-base model with MoQ, then trying to load it back for inference using the code snippet below.

import torch
import copy
from transformers.models.bert import BertLayer
from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer)
device = 'cuda'
model = AutoModelForSequenceClassification.from_pretrained('../DeepSpeedExamples/MoQ/output-8bits/qnli/checkpoint-9000/')
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
attention_mask = inputs["attention_mask"].to(device)
test_model = copy.deepcopy(model).to('cuda')
import deepspeed
import deepspeed.module_inject as module_inject
injection_policy={BertLayer:
                    module_inject.HFBertLayerPolicy}
ds_engine = deepspeed.init_inference(model,
                            mp_size=1,
                            dtype=torch.int8,
                            replace_method='auto',
                            quantization_setting=8,
                            injection_policy=injection_policy)
model = ds_engine.module
output = model(input_ids,attention_mask=attention_mask) # or output = model(input_ids,attention_mask)

This runs into an issue that complains about missing input_mask in the inputs, logs,

TypeError: compute_attention() missing 1 required positional argument: 'input_mask'

I was wondering if the setting for inference engine is missing something and what would be the best practice to run inference on a MOQ quantized model. Thanks.

--- Update, it seems like a bug that compute_attention(qkv_out) is missing input_mask, added and could pass the error.

However, I am not seeing any speed up (it actually degraded) compared to running a based-bert model with the inference engine. (quantized 6.62 ms vs bert-base 5.18 ms). I wonder if somehow should have injected quantized modules. I appreciate your help and seems an e2e tutorial for MoQ + inference would be very helpful for the community.

@RezaYazdaniAminabadi

System info (please complete the following information):

RezaYazdaniAminabadi commented 2 years ago

Hi Hamid,

Thanks for trying the quantization technique and reporting some of these issues. Yes, you are right about the bug with the input_mask. I will send a PR to fix that. Regarding the performance, as you say there is not any performance improvement as the kernels we released already is not fusing the dequantization and GeMMs during inference. We have customized kernels for this which we will release a binary to help improve the performance. Thanks for mentioning the need for the e2e tutorial. I will work on adding that tutorial. Best, Reza

gsujankumar commented 2 years ago

Hey @RezaYazdaniAminabadi, where can I find the customized kernels to improve the performance?