1) Add support for triton's generate endpoint. This reuses the PA implementation for the OpenAI HTTP client - as it supports text in / text out and streaming. The format of the input message is similar to completions, but uses "text_input" and "text_output" instead of "prompt".
2) Remove "service-kind" parameter from top level cli. Service kind can be inferred from endpoint-type and endpoint-type is more clear. endpoint-type is tied to the API and not the implementation. service kind "openAI' vs "triton" also was not parallel as "openAI" is an API and "triton" is a server. As the PA implementation is tied to service-kind - this change is only at the genai-perf level, and internally service-kind is still present it is just set based on endpoint-type. To facillitate a new endpoint-type of kserve was added.
Existing Tests have been updated.
No new tests added - could be done - or done as separate PR.
Note: most changes in genai-perf - but a small change added to PA - to allow for using the end of request as a completion event even for streaming cases. Since generate doesn't include an explicit done message - we use the end of the request as indication of done.
This PR does two main things:
1) Add support for triton's generate endpoint. This reuses the PA implementation for the OpenAI HTTP client - as it supports text in / text out and streaming. The format of the input message is similar to completions, but uses "text_input" and "text_output" instead of "prompt".
2) Remove "service-kind" parameter from top level cli. Service kind can be inferred from
endpoint-type
andendpoint-type
is more clear.endpoint-type
is tied to the API and not the implementation. service kind "openAI' vs "triton" also was not parallel as "openAI" is an API and "triton" is a server. As the PA implementation is tied to service-kind - this change is only at the genai-perf level, and internally service-kind is still present it is just set based onendpoint-type
. To facillitate a newendpoint-type
ofkserve
was added.Existing Tests have been updated.
No new tests added - could be done - or done as separate PR.
Note: most changes in genai-perf - but a small change added to PA - to allow for using the end of request as a completion event even for streaming cases. Since generate doesn't include an explicit done message - we use the end of the request as indication of done.