Enable client-side batching for OpenAI

triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.

BSD 3-Clause "New" or "Revised" License

528 stars 225 forks source link

Closed dyastremsky closed 3 weeks ago

dyastremsky commented 3 weeks ago

Enable client-side batching with the --batch-size arg. GenAI-Perf will batch the requests for the OpenAI service kind.

Batching is already supported for rankings and embeddings. This PR expands this support to the completions and chat endpoints.

TODO:

Add tests
Parse input/output sequences (metrics) in a way that accounts for batches for the completions endpoint

dyastremsky commented 3 weeks ago

It looks like OpenAI's chat completions API does not properly support client-side batching. See here: https://community.openai.com/t/batching-with-chatcompletion-endpoint/137723/2

The legacy completions API does, but it's not worth updating the code just to accommodate that API at this time. I'll close out this PR.