replicate / cog-triton

A cog implementation of Nvidia's Triton server
Apache License 2.0
11 stars 0 forks source link

Joe/lang 214 add mock cog triton concurrency test to cog triton directory #24

Closed joehoover closed 4 months ago

joehoover commented 4 months ago

I recently had a regression such that increasing the concurrency of requests to local cog-triton substantially degraded performance. This was previously observed behavior and it was resolved. Accordingly, the regression was unexpected, inexplicable, and deeply frustrating.

In general, we do not have sufficient visibility into the isolated performance of the various components in our system. Is performance degraded because of Triton, TRT-LLM, cog, predict.py, or some interaction between two or more of these components? Who knows!

This PR targets this problem by adding /mock-cog-triton/ to the cog-triton repo. /mock-cog-triton/ includes the mocked predict.py that we've used to test cog performance.

Including it in cog-triton will make it easier for us to continuously validate cog-performance and, in general, isolate and observe the performance of the cog portion of our system.

This PR: