triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
528 stars 225 forks source link

Enable client-side batching for OpenAI #735

Closed dyastremsky closed 3 weeks ago

dyastremsky commented 3 weeks ago

Enable client-side batching with the --batch-size arg. GenAI-Perf will batch the requests for the OpenAI service kind.

Batching is already supported for rankings and embeddings. This PR expands this support to the completions and chat endpoints.

TODO:

dyastremsky commented 3 weeks ago

It looks like OpenAI's chat completions API does not properly support client-side batching. See here: https://community.openai.com/t/batching-with-chatcompletion-endpoint/137723/2

The legacy completions API does, but it's not worth updating the code just to accommodate that API at this time. I'll close out this PR.