triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
711 stars 108 forks source link

sequence_length output tensor does not reflect the position of end_id token. #634

Closed jxchenus closed 3 weeks ago

jxchenus commented 4 weeks ago

When using tensorrtllm backend, the value in sequence_length output tensor is always the sum of input_lengths and request_output_len, and does not reflect the position of end_id token.

In contract, when using python backend, if we specify output_sequence_lengths to true, the value in sequence_lengths output tensor reflects the position of the first end_id token.

jxchenus commented 3 weeks ago

Pasting the discussion thread here:

Regarding the trtllm backend output: We’re using the two outputs from the config directly output_ids and sequence_length . The input request_output_len is required (A value of 0 or -1 is not valid). Our input has length of 37. [ 3 108 27173 4435 44698 414 409 3812 423 4 125000 146304 146305 146306 146307 146308 146309 146310 146311 146312 146313 146314 146315 146316 146317 146318 146319 146320 146321 146322 146323 146324 146325 146326 146327 146328 5 ] The expect output is 108 468 109 with 109 being the END_ID. When we use request_output_len=8 . We got this output: output_ids: [[ 3 108 27173 4435 44698 414 409 3812 423 4 125000 146304 146305 146306 146307 146308 146309 146310 146311 146312 146313 146314 146315 146316 146317 146318 146319 146320 146321 146322 146323 146324 146325 146326 146327 146328 5 108 468 109 108 1539 109 108 109]] sequence_length: [45] The biggest problem here is that generation continues after seeing the first END_ID 109 and ends only until it reaches request_output_len . For the sequence_length , we kind of expect it to be 40 (37+3) instead of 45 (37+8).

jxchenus commented 3 weeks ago

ah I see what you meant now. I also observe if I passed in input token ids directly, it will always generate "request_seq_len" many of output token ids but when I pass in non token ids directly (like a text prompt), it ends without aligning with "reqest_seq_len" length.