[Question] Understanding Generation Logits & Context Logits

here4dadata commented 4 months ago

After an engine is built with the --gather_all_token_logits and a call is made through the backend with return_context_logits: True, return_generation_logits: True, It seems as though to piece together the full text_output, we have to grab the very last context_logits, and the remaining generation_logits.

An example would be prompting a model like Llama-3 to "Say Yes!" ... The reply from the model being : "YES!"

Imagine calling the /generate endpoint of ensemble with a request that contains this in the data body:

data = {
    "text_input": f'{prompt}',
    "max_tokens": 8,
    "return_log_probs": True,
    "return_context_logits": True,
    "return_generation_logits": True, 
    "parameters": {
       "temperature": 1,
       "top_k": 5,
    }
}

text_output is : "YES!"

Now when you take the np.argmax of the generation_logits and feed it back through the tokenizer, all you would see is:

index 0 : (token_id: 0, token: "!")
index 1 : (token_id: 128009, token: '<|eot_id|>')

The token associated with "YES" is actually found when you np.argmax the last result in the context_logits matrix, i.e:

index -1 : (token_id: 14331, token: "YES")

Should the "YES" not be in the generation_logits as well? Am I not understanding something fundamental?

I am using TensorRT-LLM v0.10.0 and the nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 image to host the model.

Thanks for any help!

byshiue commented 4 months ago

That's because the "YES" token is computed by logits of context phase. So, it is stored in the context logits.

here4dadata commented 4 months ago

Thanks for that clarification @byshiue.

triton-inference-server / tensorrtllm_backend

[Question] Understanding Generation Logits & Context Logits #530