replicate / cog-triton

A cog implementation of Nvidia's Triton server
Apache License 2.0
11 stars 0 forks source link

a basic check at startup #45

Closed technillogue closed 2 months ago

technillogue commented 2 months ago

we want to make it easier to mass-rollout new versions. there are different ways we could check if the new version works from the outside or as part of CI, but they're hard to get right. if we can fail setup in the case of an error, or a broken node, we can trigger a deployment rollback.

I considered making this a more thorough test. ideally, we would test performance and lack of errors at max batch size and input length, and check that there are no correctness degradations by running MMLU. even just doing MMLU, I found that I can't pick an MMLU question that llama-2-7b would reliably get right: https://replicate.com/p/jzztsk82w1rgm0cg36htj7bj08.

we could expand on this further by having an external service to query "is this the first time this version has booted" and, if so, run a full MMLU + stress test check and then store the result in the external service and alert us if it fails. in that situation we could use GITHUB_ACTIONS_RUN_ID to identify the version (set by official models CI), and could compare to the previous MMLU result to identify quality in a model-agnostic way

I believe the timeouts we were seeing were specific to the removed nodes and should no longer be a problem, but a check that runs every setup is helpful for identifying if that problem ever comes back.

technillogue commented 2 months ago

discussed with @joehoover