triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
520 stars 225 forks source link

vLLM ITL fix #667

Closed IzzyPutterman closed 1 month ago

IzzyPutterman commented 1 month ago

In vLLM outputs can be empty in the middle of the stream.

nv-hwoo commented 1 month ago

@IzzyPutterman was this breaking the ITL calculation (e.g. empty output --> zero token --> divide by zero)? Just trying to understand the context.

IzzyPutterman commented 1 month ago

@IzzyPutterman was this breaking the ITL calculation (e.g. empty output --> zero token --> divide by zero)? Just trying to understand the context.

(github not letting me reply), but empty-output -> zero token -> replaced by a 1 token instead in llm_metrics-> total number of tokens used for ITL calculation is > that total expected, but thats not reported. So ITL is smaller than expected.

nv-hwoo commented 1 month ago

@IzzyPutterman was this breaking the ITL calculation (e.g. empty output --> zero token --> divide by zero)? Just trying to understand the context.

(github not letting me reply), but empty-output -> zero token -> replaced by a 1 token instead in llm_metrics-> total number of tokens used for ITL calculation is > that total expected, but thats not reported. So ITL is smaller than expected.

Ah I see. Yeah that makes sense. I think these will be irrelevant once we update to a new ITL formula, but the change looks good to me.