ORT whisper on CUDAExecutionProvider is slower than PyTorch

yufenglee commented 1 year ago

System Info

optimum 1.7.1

Who can help?

@lewtun , @michaelbenayoun

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

The exported whisper ONNX decoder model has encoder.value as outputs. Actually encoder.value are constant in the decoding stage. Those copies with Identiy are very heavy and make the performance much worse.

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
session_options = onnxruntime.SessionOptions()
session_options.enable_profiling = True
model_ort = ORTModelForSpeechSeq2Seq.from_pretrained(whisper_model_name, from_transformers=True, use_io_binding=True, session_options=session_options)
generator_ort = pipeline(
    task="automatic-speech-recognition",
    model=model_ort,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    # batch_size=2,
    # device=0,
    chunk_length_s=30,
    stride_length_s=(1,1), # must have with chunk_length_s
    generate_kwargs={"max_new_tokens": 1024},
)

Expected behavior

Don't output encoder value in decoder model.

fxmarty commented 1 year ago

Thank you for the hint @yufenglee, will fix shortly!

xenova commented 1 year ago

I think this might be connected to an issue I am currently having with onnxruntime-web: https://discuss.huggingface.co/t/when-exporting-seq2seq-models-with-onnx-why-do-we-need-both-decoder-with-past-model-onnx-and-decoder-model-onnx/33354

For some reason, if I use an empty tensor of the correct shape (e.g., [batch, heads, 0, dim]) - which is completely valid for the PyTorch implementation - it returns an empty tensor for the present key values (as opposed to the PyTorch implementation which produces the correct output).

To preemptively answer the question as to why I would pass empty tensors to the decoder, it should bypass the decoder_model.onnx, meaning one would not need to export both decoder_model.onnx and decoder_model_with_past.onnx (which would make my Transformers.js library much more efficient!)

yufenglee commented 1 year ago

Thank you for the hint @yufenglee, will fix shortly!

Thanks!

hannan72 commented 1 year ago

I measured inference time of onnx whisper-medium from HF (converted using Optimum) for a 10 second audio on a A100 gpu:

Deconstruction of inference time for each of the onnx model:

encoder: 60ms
decoder: 130ms
decoder_with_past: 105ms (each run)

--> totally it takes ~2250 ms for a 10 second audio, however it's very slower than Pytorch deployment

Deconstruction of inference time for each of the pytorch model:

encoder: 12ms
decoder: 22ms
decoder_with_past: 20ms (each run)

--> totally it takes ~390 ms for a 10 second audio

I'm looking forward for any update in Optimum.

fxmarty commented 1 year ago

Hi @hannan72 thank you for having a detailed look. If you are able to share your script here, it would be helpful. Thank you!

hannan72 commented 1 year ago

Hi @hannan72 thank you for having a detailed look. If you are able to share your script here, it would be helpful. Thank you!

Hi I've just put some "time printing" inside the transformers source codes:

for encoder inference: https://github.com/huggingface/transformers/blob/b90fbc7e0ba41dfd6b343e7e2274443f19087f36/src/transformers/generation/utils.py#L1249-L1252

for decoder inference: https://github.com/huggingface/optimum/blob/bde5115d20bc43285827dcaadecc9fc717f2973d/optimum/onnxruntime/modeling_seq2seq.py#L1000-L1004

for decoder_with_past_model inference: https://github.com/huggingface/optimum/blob/bde5115d20bc43285827dcaadecc9fc717f2973d/optimum/onnxruntime/modeling_seq2seq.py#L1006-L1011

hannan72 commented 1 year ago

ONNX Deployment code:

sess_options = rt.SessionOptions()
sess_options.enable_profiling = True
path = 'openai/whisper_medium'
model = ORTModelForSpeechSeq2Seq.from_pretrained(
        path,
        use_io_binding=True,
        export = True,
        session_options=sess_options,
        provider="CUDAExecutionProvider",
    ).to('cuda')
config = WhisperConfig.from_pretrained(path)
processor = WhisperProcessor.from_pretrained(path)

forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="transcribe")
input_features = processor(data_waveform[0], padding="max_length", return_tensors="pt").input_features.cuda()
whisper_deploy = model.generate(inputs=input_features, max_new_tokens = 70,forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(whisper_deploy, skip_special_tokens=True)

Pytorch Deployment code:

path = 'openai/whisper_medium'
model = WhisperForConditionalGeneration.from_pretrained(path).cuda()
config = WhisperConfig.from_pretrained(path)
processor = WhisperProcessor.from_pretrained(path)

forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="transcribe")
input_features = processor(data_waveform[0], padding="max_length", return_tensors="pt").input_features.cuda()
whisper_deploy = model.generate(inputs=input_features, max_new_tokens = 70,forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(whisper_deploy, skip_special_tokens=True)

hannan72 commented 1 year ago

I get some warning while converting whisper model to onnx or loading onnx whisper model:

[W:onnxruntime:, session_state.cc:1136 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.

It seems some nodes are calculated in CPU and maybe transition between CPU and GPU during inference is the cause of such inefficiency.

fxmarty commented 1 year ago

Hi @yufenglee , #872 should fix the issue. I would recommend you to use the CLI optimum-cli export onnx to avoid exporting at every run.

Thank you @hannan72, this is helpful to investigate.

Results for an input of shape (1, 80, 3000), on openai/whisper-small:

Framework	Inference time (s)
PyTorch 1.13.1 (eager), cuda	0.321
ORT + CUDAExecutionProvider + IOBinding (new)	0.388
ORT + CUDAExecutionProvider + IOBinding (old)	0.455

Framework	Inference time (s)
PyTorch 1.13.1 (eager), cpu	2.405
ORT + CPUExecutionProvider (new)	2.133
ORT + CPUExecutionProvider (old)	3.287

GPU: GeForce RTX 3060 Mobile CPU: i7-1280P

So it seems we are still slower than PyTorch on CUDAExecutionProvider, @yufenglee don't hesitate if you have any suggestion for that.

Using the export on fp16 (optimum-cli export onnx --device cuda --fp16 --model openai/whisper-small whisper_small_new) I get:

Framework	Inference time (s)
PyTorch 1.13.1 (eager), cuda, fp16	0.204
ORT + CUDAExecutionProvider + IOBinding (new, fp16)	0.282

fxmarty commented 1 year ago

Script:

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from datasets import load_dataset
import time
import gc
import torch
model_id = "/home/fxmarty/optimum/whisper_small_new"
processor = AutoProcessor.from_pretrained(model_id)

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
inputs = processor.feature_extractor(ds[9]["audio"]["array"], return_tensors="pt").to("cuda")

inputs = inputs.to(torch.float16)

##
ort_model = ORTModelForSpeechSeq2Seq.from_pretrained(model_id, provider="CUDAExecutionProvider")

# warmup
_ = ort_model.generate(inputs=inputs.input_features)

n_batch = 10
start = time.time()
for i in range(n_batch):
    gen_tokens = ort_model.generate(inputs=inputs.input_features)
end = time.time()
ort_time = end - start
print(f"ORT: {ort_time / n_batch:.3f} s")

del ort_model
gc.collect()

##
pt_model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-small", torch_dtype=torch.float16).to("cuda")

# warmup
_ = pt_model.generate(inputs=inputs.input_features)

n_batch = 10
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for i in range(n_batch):
    gen_tokens = pt_model.generate(inputs=inputs.input_features)
end_event.record()
torch.cuda.synchronize()
pt_time = start_event.elapsed_time(end_event) * 1e-3
print(f"PT: {pt_time / n_batch:.3f} s")

yufenglee commented 1 year ago

@fxmarty, we are working on the fusion of whisper model. Our internal benchmark shows we get good performance with the fusion. Will keep you posted.

kunal-vaishnavi commented 1 year ago

Thank you for removing the constant outputs from the attention subgraphs in the decoder with past model. Can this also be removed from the decoder with past model?

Screenshot 2023-03-23 103514

fxmarty commented 1 year ago

Thank you @kunal-vaishnavi for notifying, will submit a PR!

fxmarty commented 1 year ago

Hi @kunal-vaishnavi, https://github.com/huggingface/optimum/pull/920 has been merged, hopefully compared to the above benchmark we should get slightly better!

fxmarty commented 1 year ago

There remains a few Identity nodes for the layernorm initializers, not sure why. I asked on PyTorch's slack why they are there: https://pytorch.slack.com/archives/CPMQGB42K/p1680078193941689 . Maybe you know @yufenglee ?

huggingface / optimum