microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
301 stars 71 forks source link

Usage example for `get_output()` #591

Open natke opened 3 weeks ago

natke commented 3 weeks ago

Discussed in https://github.com/microsoft/onnxruntime-genai/discussions/522

Originally posted by **anutkk** May 26, 2024 According to the [documentation ](https://onnxruntime.ai/docs/genai/api/python.html#get-output) `generator.get_output()` should return the generated logits. In practice, this is the error message I get: ``` --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[37], [line 1](vscode-notebook-cell:?execution_count=37&line=1) ----> [1](vscode-notebook-cell:?execution_count=37&line=1) generator.get_output() TypeError: get_output(): incompatible function arguments. The following argument types are supported: 1. (self: onnxruntime_genai.onnxruntime_genai.Generator, arg0: str) -> numpy.ndarray Invoked with: ``` The function expects an input string. However no matter what I put, the output is `array([], dtype=float64)`. What is the correct way to use this method?
natke commented 3 weeks ago

So after digging through the C++ source code, the answer is:

logits=generator.get_output("logits") However for some reason at the first step the maximum token is different from the output of get_next_tokens(). Not sure if this is a bug or a misunderstanding.

import onnxruntime_genai as og import numpy as np

prompt = '''<|user|> Please tell me the time.<|end|> <|assistant|>'''

model=og.Model("/home/ubuntu/models/Phi-3-mini-4k-instruct-onnx/cuda/cuda-fp16/")

tokenizer = og.Tokenizer(model)

tokens = tokenizer.encode(prompt)

params=og.GeneratorParams(model) params.input_ids = tokens

generator = og.Generator(model, params) i = 0 while not generator.is_done(): generator.compute_logits() generator.generate_next_token()
new_token = generator.get_next_tokens()[0] logits = generator.get_output("logits").squeeze() new_token2 = np.argmax(logits) print(new_token, " ", new_token2) i += 1 if i > 10: break

print() And the result:

306 18 29915 29915 29885 29885 9368 9368 304 304 3867 3867 1855 1855 29899 29899 2230 2230 848 848 29892 29892

yufenglee commented 3 weeks ago

It is expected. The get_output(output_name) returns the output tensor with output_name. The prompt case returns the logits for the whole prompt with shape (batch, prompt_length, model_hidden_size)

natke commented 2 weeks ago

Hi @anutkk

We added some more documentation for this API here: https://onnxruntime.ai/docs/genai/api/python.html#get-output

Please let us know if that helps you resolve your issue.

anutkk commented 2 weeks ago

Thanks @natke and @yufenglee If so how can I capture the logit of the first generated token? The matrix returned for the "prompt" (i.e. before calling compute_logits() I guess?) is zeroes. I'm not entirely sure https://github.com/microsoft/onnxruntime-genai/pull/611 deals with this issue.

yufenglee commented 1 week ago

Thanks @natke and @yufenglee If so how can I capture the logit of the first generated token? The matrix returned for the "prompt" (i.e. before calling compute_logits() I guess?) is zeroes. I'm not entirely sure #611 deals with this issue.

@anutkk, to get the logits you can do like this:

    generator.compute_logits()
    logits = generator.get_output('logits')
    assert np.allclose(logits[:,:,::200], expected_sampled_logits_prompt, atol=1e-3)
    generator.generate_next_token()

, like it is shown in the test file: https://github.com/microsoft/onnxruntime-genai/blob/c622cc11622dfb88b8b43f1e72347119b8728a25/test/python/test_onnxruntime_genai_api.py#L208-L211

natke commented 1 week ago

@anutkk Basically the first call to get_output("logits") will give you the logits of the prompt, and the second call will give you the logits of the first generated token. Let us know if that solves your issue!