triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
520 stars 225 forks source link

feature: triton generate support #675

Open nnshah1 opened 1 month ago

nnshah1 commented 1 month ago

This PR does two main things:

1) Add support for triton's generate endpoint. This reuses the PA implementation for the OpenAI HTTP client - as it supports text in / text out and streaming. The format of the input message is similar to completions, but uses "text_input" and "text_output" instead of "prompt".

2) Remove "service-kind" parameter from top level cli. Service kind can be inferred from endpoint-type and endpoint-type is more clear. endpoint-type is tied to the API and not the implementation. service kind "openAI' vs "triton" also was not parallel as "openAI" is an API and "triton" is a server. As the PA implementation is tied to service-kind - this change is only at the genai-perf level, and internally service-kind is still present it is just set based on endpoint-type. To facillitate a new endpoint-type of kserve was added.

Existing Tests have been updated.

No new tests added - could be done - or done as separate PR.

Note: most changes in genai-perf - but a small change added to PA - to allow for using the end of request as a completion event even for streaming cases. Since generate doesn't include an explicit done message - we use the end of the request as indication of done.