After an engine is built with the --gather_all_token_logits and a call is made through the backend with return_context_logits: True, return_generation_logits: True, It seems as though to piece together the full text_output, we have to grab the very last context_logits, and the remaining generation_logits.
An example would be prompting a model like Llama-3 to "Say Yes!" ... The reply from the model being : "YES!"
Imagine calling the /generate endpoint of ensemble with a request that contains this in the data body:
After an engine is built with the
--gather_all_token_logits
and a call is made through the backend withreturn_context_logits: True
,return_generation_logits: True
, It seems as though to piece together the full text_output, we have to grab the very last context_logits, and the remaining generation_logits.An example would be prompting a model like Llama-3 to "Say Yes!" ... The reply from the model being : "YES!"
Imagine calling the
/generate
endpoint ofensemble
with a request that contains this in the data body:text_output is : "YES!"
Now when you take the
np.argmax
of the generation_logits and feed it back through the tokenizer, all you would see is:The token associated with "YES" is actually found when you
np.argmax
the last result in the context_logits matrix, i.e:Should the "YES" not be in the generation_logits as well? Am I not understanding something fundamental?
I am using TensorRT-LLM v0.10.0 and the
nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
image to host the model.Thanks for any help!