Open natke opened 3 weeks ago
So after digging through the C++ source code, the answer is:
logits=generator.get_output("logits") However for some reason at the first step the maximum token is different from the output of get_next_tokens(). Not sure if this is a bug or a misunderstanding.
import onnxruntime_genai as og import numpy as np
prompt = '''<|user|> Please tell me the time.<|end|> <|assistant|>'''
model=og.Model("/home/ubuntu/models/Phi-3-mini-4k-instruct-onnx/cuda/cuda-fp16/")
tokenizer = og.Tokenizer(model)
tokens = tokenizer.encode(prompt)
params=og.GeneratorParams(model) params.input_ids = tokens
generator = og.Generator(model, params)
i = 0
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
logits = generator.get_output("logits").squeeze()
new_token2 = np.argmax(logits)
print(new_token, " ", new_token2)
i += 1
if i > 10:
break
print() And the result:
306 18 29915 29915 29885 29885 9368 9368 304 304 3867 3867 1855 1855 29899 29899 2230 2230 848 848 29892 29892
It is expected. The get_output(output_name) returns the output tensor with output_name. The prompt case returns the logits for the whole prompt with shape (batch, prompt_length, model_hidden_size)
Hi @anutkk
We added some more documentation for this API here: https://onnxruntime.ai/docs/genai/api/python.html#get-output
Please let us know if that helps you resolve your issue.
Thanks @natke and @yufenglee
If so how can I capture the logit of the first generated token? The matrix returned for the "prompt" (i.e. before calling compute_logits()
I guess?) is zeroes.
I'm not entirely sure https://github.com/microsoft/onnxruntime-genai/pull/611 deals with this issue.
Thanks @natke and @yufenglee If so how can I capture the logit of the first generated token? The matrix returned for the "prompt" (i.e. before calling
compute_logits()
I guess?) is zeroes. I'm not entirely sure #611 deals with this issue.
@anutkk, to get the logits you can do like this:
generator.compute_logits()
logits = generator.get_output('logits')
assert np.allclose(logits[:,:,::200], expected_sampled_logits_prompt, atol=1e-3)
generator.generate_next_token()
, like it is shown in the test file: https://github.com/microsoft/onnxruntime-genai/blob/c622cc11622dfb88b8b43f1e72347119b8728a25/test/python/test_onnxruntime_genai_api.py#L208-L211
@anutkk Basically the first call to get_output("logits") will give you the logits of the prompt, and the second call will give you the logits of the first generated token. Let us know if that solves your issue!
Discussed in https://github.com/microsoft/onnxruntime-genai/discussions/522