Joe/yorickvp/ci/joe/lang 218 investigate tps performance degradation

This PR:

Adds mock_cog_triton/, which includes a mocked cog-triton predict.py that emits tokens at a specified rate. This is useful for performance testing cog. This new directory also includes a version of test_perf.py that works with the mock server and emits useful information, such as server-side (within predict.py) vs. client-side performance metrics
Updates our Triton inference implementation to use /ensemble/ instead of /tensorrt_llm_bls/ as the request entrypoint. We do this because BLS performance is unreliable and seems to randomly introduce substantial latency to requests. To do this, this PR:
- Move token accumulation and decoding into predict.py
- Updates the preproc model.py in triton_templates and triton_model_repo so that it returns output_ids instead of strings
- Updates output type expectations in template and model repo configs for ensemble and preprocessor.
Adds an envar option LOG_PERFORMANCE_METRICS such that, if True, server-side performance metrics will be logged after all tokens are generated.
Updates scripts/test_perf.py to parse request logs, extract optional server-side metrics, and include them in the generated performance report.

replicate / cog-triton