triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
551 stars 227 forks source link

Calculate total output token from full text #698

Closed IzzyPutterman closed 3 months ago

IzzyPutterman commented 3 months ago

Calculating total tokens from the sum of the chunks can result in numbers that are off (10% in the case of llama3). This is due to the fact that our old WAR to use the "!" + text, actually results in a single token instead of 2 tokens somethings. For example "!." is just 1 token instead of 2.

!, [17581]                                                                                                                                                                                                                                                                       
!? [58490]                                                                                                                                                                                                                                                                       
!, [17581]                                                                                                                                                                                                                                                                       
!, [17581]                                                                                                                                                                                                                                                                       
!? [58490]                                                                                                                                                                                                                                                                       
!, [17581]                                                                                                                                                                                                                                                                       
!. [15725] 

Given that the ITL has been changed to be on the request level, we should change token count to be more accurate with this flexibility.

nv-hwoo commented 3 months ago

CI ref: 15744750

nv-hwoo commented 3 months ago

Good question. It won't affect the statistics because this is not part of the statistics and we are keeping the output token counts only for the visualization purposes and support our token position vs ITL plot. This is because with our updated ITL metric, there's no longer token-level inter token latencies.

dyastremsky commented 3 months ago

Good question. It won't affect the statistics because this is not part of the statistics and we are keeping the output token counts only for the visualization purposes and support our token position vs ITL plot. This is because with our updated ITL metric, there's no longer token-level inter token latencies.

Got it, thanks for explaining! Would it be possible to add tests where the sum of the token counts is different than the token count of the full text output? Or is that no longer necessary?

nv-hwoo commented 3 months ago

@dyastremsky Added checking if the sum equals to total token count. Individual token counts are not part of statistics but I think it will never hurt to add more tests :) Thanks for the feedback 👍