Open yufenglee opened 1 year ago
Thank you for the hint @yufenglee, will fix shortly!
I think this might be connected to an issue I am currently having with onnxruntime-web: https://discuss.huggingface.co/t/when-exporting-seq2seq-models-with-onnx-why-do-we-need-both-decoder-with-past-model-onnx-and-decoder-model-onnx/33354
For some reason, if I use an empty tensor of the correct shape (e.g., [batch, heads, 0, dim]) - which is completely valid for the PyTorch implementation - it returns an empty tensor for the present key values (as opposed to the PyTorch implementation which produces the correct output).
To preemptively answer the question as to why I would pass empty tensors to the decoder, it should bypass the decoder_model.onnx, meaning one would not need to export both decoder_model.onnx and decoder_model_with_past.onnx (which would make my Transformers.js library much more efficient!)
Thank you for the hint @yufenglee, will fix shortly!
Thanks!
I measured inference time of onnx whisper-medium from HF (converted using Optimum) for a 10 second audio on a A100 gpu:
Deconstruction of inference time for each of the onnx model:
encoder: 60ms
decoder: 130ms
decoder_with_past: 105ms (each run)
--> totally it takes ~2250 ms for a 10 second audio, however it's very slower than Pytorch deployment
Deconstruction of inference time for each of the pytorch model:
encoder: 12ms
decoder: 22ms
decoder_with_past: 20ms (each run)
--> totally it takes ~390 ms for a 10 second audio
I'm looking forward for any update in Optimum.
Hi @hannan72 thank you for having a detailed look. If you are able to share your script here, it would be helpful. Thank you!
Hi @hannan72 thank you for having a detailed look. If you are able to share your script here, it would be helpful. Thank you!
Hi I've just put some "time printing" inside the transformers source codes:
for encoder inference: https://github.com/huggingface/transformers/blob/b90fbc7e0ba41dfd6b343e7e2274443f19087f36/src/transformers/generation/utils.py#L1249-L1252
for decoder inference: https://github.com/huggingface/optimum/blob/bde5115d20bc43285827dcaadecc9fc717f2973d/optimum/onnxruntime/modeling_seq2seq.py#L1000-L1004
for decoder_with_past_model inference: https://github.com/huggingface/optimum/blob/bde5115d20bc43285827dcaadecc9fc717f2973d/optimum/onnxruntime/modeling_seq2seq.py#L1006-L1011
ONNX Deployment code:
sess_options = rt.SessionOptions()
sess_options.enable_profiling = True
path = 'openai/whisper_medium'
model = ORTModelForSpeechSeq2Seq.from_pretrained(
path,
use_io_binding=True,
export = True,
session_options=sess_options,
provider="CUDAExecutionProvider",
).to('cuda')
config = WhisperConfig.from_pretrained(path)
processor = WhisperProcessor.from_pretrained(path)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="transcribe")
input_features = processor(data_waveform[0], padding="max_length", return_tensors="pt").input_features.cuda()
whisper_deploy = model.generate(inputs=input_features, max_new_tokens = 70,forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(whisper_deploy, skip_special_tokens=True)
Pytorch Deployment code:
path = 'openai/whisper_medium'
model = WhisperForConditionalGeneration.from_pretrained(path).cuda()
config = WhisperConfig.from_pretrained(path)
processor = WhisperProcessor.from_pretrained(path)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="transcribe")
input_features = processor(data_waveform[0], padding="max_length", return_tensors="pt").input_features.cuda()
whisper_deploy = model.generate(inputs=input_features, max_new_tokens = 70,forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(whisper_deploy, skip_special_tokens=True)
I get some warning while converting whisper model to onnx or loading onnx whisper model:
[W:onnxruntime:, session_state.cc:1136 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
It seems some nodes are calculated in CPU and maybe transition between CPU and GPU during inference is the cause of such inefficiency.
Hi @yufenglee , #872 should fix the issue. I would recommend you to use the CLI optimum-cli export onnx
to avoid exporting at every run.
Thank you @hannan72, this is helpful to investigate.
Results for an input of shape (1, 80, 3000), on openai/whisper-small
:
Framework | Inference time (s) |
---|---|
PyTorch 1.13.1 (eager), cuda | 0.321 |
ORT + CUDAExecutionProvider + IOBinding (new) | 0.388 |
ORT + CUDAExecutionProvider + IOBinding (old) | 0.455 |
Framework | Inference time (s) |
---|---|
PyTorch 1.13.1 (eager), cpu | 2.405 |
ORT + CPUExecutionProvider (new) | 2.133 |
ORT + CPUExecutionProvider (old) | 3.287 |
GPU: GeForce RTX 3060 Mobile CPU: i7-1280P
So it seems we are still slower than PyTorch on CUDAExecutionProvider, @yufenglee don't hesitate if you have any suggestion for that.
Using the export on fp16 (optimum-cli export onnx --device cuda --fp16 --model openai/whisper-small whisper_small_new
) I get:
Framework | Inference time (s) |
---|---|
PyTorch 1.13.1 (eager), cuda, fp16 | 0.204 |
ORT + CUDAExecutionProvider + IOBinding (new, fp16) | 0.282 |
Script:
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from datasets import load_dataset
import time
import gc
import torch
model_id = "/home/fxmarty/optimum/whisper_small_new"
processor = AutoProcessor.from_pretrained(model_id)
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
inputs = processor.feature_extractor(ds[9]["audio"]["array"], return_tensors="pt").to("cuda")
inputs = inputs.to(torch.float16)
##
ort_model = ORTModelForSpeechSeq2Seq.from_pretrained(model_id, provider="CUDAExecutionProvider")
# warmup
_ = ort_model.generate(inputs=inputs.input_features)
n_batch = 10
start = time.time()
for i in range(n_batch):
gen_tokens = ort_model.generate(inputs=inputs.input_features)
end = time.time()
ort_time = end - start
print(f"ORT: {ort_time / n_batch:.3f} s")
del ort_model
gc.collect()
##
pt_model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-small", torch_dtype=torch.float16).to("cuda")
# warmup
_ = pt_model.generate(inputs=inputs.input_features)
n_batch = 10
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for i in range(n_batch):
gen_tokens = pt_model.generate(inputs=inputs.input_features)
end_event.record()
torch.cuda.synchronize()
pt_time = start_event.elapsed_time(end_event) * 1e-3
print(f"PT: {pt_time / n_batch:.3f} s")
@fxmarty, we are working on the fusion of whisper model. Our internal benchmark shows we get good performance with the fusion. Will keep you posted.
Thank you for removing the constant outputs from the attention subgraphs in the decoder with past model. Can this also be removed from the decoder with past model?
Thank you @kunal-vaishnavi for notifying, will submit a PR!
Hi @kunal-vaishnavi, https://github.com/huggingface/optimum/pull/920 has been merged, hopefully compared to the above benchmark we should get slightly better!
There remains a few Identity
nodes for the layernorm initializers, not sure why. I asked on PyTorch's slack why they are there: https://pytorch.slack.com/archives/CPMQGB42K/p1680078193941689 . Maybe you know @yufenglee ?
System Info
Who can help?
@lewtun , @michaelbenayoun
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The exported whisper ONNX decoder model has encoder.value as outputs. Actually encoder.value are constant in the decoding stage. Those copies with Identiy are very heavy and make the performance much worse.
Expected behavior
Don't output encoder value in decoder model.