Open XciciciX opened 1 week ago
@XciciciX, Could you share some the detail steps to reproduce the issue?
For example, command lines to export onnx model, optimize onnx model, and test script. Or share the optimized onnx model. You can also look at operator spec if you suspect some attention node is not corrected fused: https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md
Thank you for your response. @tianleiwu
Here is part of the code related to model export.
model = whisper.load_model("medium")
x_mel = compute_features("./data/test.mp3")
x_audio = model.encoder(x_mel)
torch.onnx.export(
model.encoder,
(x_mel),
"./models/encoder.onnx",
input_names=["x"],
output_names=["out"],
dynamic_axes={
"x": {0: "batch"},
"out": {0: "batch"},
},
)
torch.onnx.export(
model.decoder,
(x_tokens, x_audio),
"./models/decoder.onnx",
input_names=["tokens", "audio"],
output_names=["out"],
dynamic_axes={
"tokens": {0: "batch", 1: "seq"},
"audio": {0: "batch"},
"out": {0: "batch", 1: "seq"},
},
)
Then, they are optimized by: python -m onnxruntime.transformers.optimizer --input ./whisper-medium-onnx/decoder.onnx --output ./whisper-medium-onnx-test/decoder__mha.onnx --float16 --model_type bart --num_heads 16 --hidden_size 1024 --use_multi_head_attention
Here are the exported models
https://drive.google.com/drive/folders/16tbQ46OB91hQtIC4XJJvwNVnl5YaVU60?usp=drive_link
encoder.onnx and decoder.onnx are not optimized. The ones with _mha are optimized.
Here is the test script. The original models can run. The optimized models can run too but the results are wrong.
import numpy as np
import onnxruntime
sess_encoder = onnxruntime.InferenceSession("./models/encoder.onnx", providers=["CUDAExecutionProvider"])
sess_decoder = onnxruntime.InferenceSession("./models/decoder.onnx", providers=["CUDAExecutionProvider"])
start = time.time()
x_mel_fp32 = compute_features("./data/test.mp3")
x_mel_fp16 = x_mel_fp32.to(dtype=torch.float16)
out_encoder, = sess_encoder.run(["out"], {"x": x_mel_fp32.numpy()})
tokens = list(tokenizer.sot_sequence_including_notimestamps)
next_token = tokenizer.sot
while len(tokens) <= max_tokens and next_token != tokenizer.eot:
out_decoder, = sess_decoder.run(
["out"],
{
"tokens": np.asarray([tokens], dtype="int64"),
"audio": out_encoder,
},
)
next_token = out_decoder[0, -1].argmax()
tokens.append(next_token)
print("took", time.time() - start, "seconds")
print(tokenizer.decode(tokens))
Describe the issue
I directly export whisper models to ONNX model from whisper module. I wrote an inference script and the results are correct. I want to reduce the runtime so I used the bart transformer optimizer. The number of heads and the hidden size are correct because I followed the parameters mentioned in the Whisper paper. After that, the result changes with the same inference script. It cannot end correctly. I think the attention in whisper model are not correctly connected after optimization. Some bugs may exist.
To reproduce
Whisper medium model
Urgency
Yes
Platform
Windows
OS Version
11
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
latest version
ONNX Runtime API
Python
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
Yes