Adds mock_cog_triton/, which includes a mocked cog-triton predict.py that emits tokens at a specified rate. This is useful for performance testing cog. This new directory also includes a version of test_perf.py that works with the mock server and emits useful information, such as server-side (within predict.py) vs. client-side performance metrics
Updates our Triton inference implementation to use /ensemble/ instead of /tensorrt_llm_bls/ as the request entrypoint. We do this because BLS performance is unreliable and seems to randomly introduce substantial latency to requests. To do this, this PR:
Move token accumulation and decoding into predict.py
Updates the preproc model.py in triton_templates and triton_model_repo so that it returns output_ids instead of strings
Updates output type expectations in template and model repo configs for ensemble and preprocessor.
Adds an envar option LOG_PERFORMANCE_METRICS such that, if True, server-side performance metrics will be logged after all tokens are generated.
Updates scripts/test_perf.py to parse request logs, extract optional server-side metrics, and include them in the generated performance report.
This PR:
Adds
mock_cog_triton/
, which includes a mocked cog-triton predict.py that emits tokens at a specified rate. This is useful for performance testing cog. This new directory also includes a version oftest_perf.py
that works with the mock server and emits useful information, such as server-side (within predict.py) vs. client-side performance metricsUpdates our Triton inference implementation to use
/ensemble/
instead of/tensorrt_llm_bls/
as the request entrypoint. We do this because BLS performance is unreliable and seems to randomly introduce substantial latency to requests. To do this, this PR:model.py
in triton_templates and triton_model_repo so that it returns output_ids instead of stringsAdds an envar option
LOG_PERFORMANCE_METRICS
such that, ifTrue
, server-side performance metrics will be logged after all tokens are generated.Updates
scripts/test_perf.py
to parse request logs, extract optional server-side metrics, and include them in the generated performance report.