huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.5k stars 447 forks source link

Wrong output from ONNX speculative decoding (assistant_model) #1924

Open BowenBao opened 3 months ago

BowenBao commented 3 months ago

System Info

python              3.10.14
transformers        4.37.2
optimum             1.20.0
onnxruntime         1.18.0

Who can help?

No response

Information

Tasks

Reproduction (minimal, reproducible, runnable)

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch
import time

# device = "cuda" if torch.cuda.is_available() else "cpu"
device = "cpu"

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(
    device
)

model = ORTModelForCausalLM.from_pretrained(
    "facebook/opt-1.3b", from_transformers=True
).to(device)
assistant_model = ORTModelForCausalLM.from_pretrained(
    "facebook/opt-125m", from_transformers=True
).to(device)
start_time = time.perf_counter()
outputs = model.generate(**inputs)
print(f"Time taken: {time.perf_counter() - start_time}")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# ["Einstein's theory of relativity states that the speed of light is constant.    "]

start_time = time.perf_counter()
outputs = model.generate(**inputs, assistant_model=assistant_model)
print(f"Time taken: {time.perf_counter() - start_time}")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
Time taken: 1.2305651204660535
["Einstein's theory of relativity states that the speed of light is constant.    "]
Time taken: 1.2063294481486082
["Einstein's theory of relativity states that the speed,\n �\n\n\n,,,"]

Expected behavior

Outputs should match between using and not using assistant model.

BowenBao commented 3 months ago

Likely related #1848