Wrong output from ONNX speculative decoding (assistant_model)

System Info

python              3.10.14
transformers        4.37.2
optimum             1.20.0
onnxruntime         1.18.0

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch
import time

# device = "cuda" if torch.cuda.is_available() else "cpu"
device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(
    device
)

model = ORTModelForCausalLM.from_pretrained(
    "facebook/opt-1.3b", from_transformers=True
).to(device)
assistant_model = ORTModelForCausalLM.from_pretrained(
    "facebook/opt-125m", from_transformers=True
).to(device)
start_time = time.perf_counter()
outputs = model.generate(**inputs)
print(f"Time taken: {time.perf_counter() - start_time}")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# ["Einstein's theory of relativity states that the speed of light is constant.    "]

start_time = time.perf_counter()
outputs = model.generate(**inputs, assistant_model=assistant_model)
print(f"Time taken: {time.perf_counter() - start_time}")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Time taken: 1.2305651204660535
["Einstein's theory of relativity states that the speed of light is constant.    "]
Time taken: 1.2063294481486082
["Einstein's theory of relativity states that the speed,\n �\n\n\n,,,"]

Expected behavior

Outputs should match between using and not using assistant model.

huggingface / optimum

Wrong output from ONNX speculative decoding (assistant_model) #1924

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior