Open soulseen opened 3 weeks ago
Hi @soulseen, this probably depends on the framework/model more than Triton generically.
It's possible that loading new weights may introduce some cold start penalties depending on what the backend/framework is doing internally at runtime. It's possible that differences in weights in some models can result in different compute paths or kernels in some models (for example ONNX has some conditional ops that can branch in the model execution graph), so it's hard to answer that generically for all models/frameworks.
In terms of some general aspects of the GPU and CUDA-related runtimes being "warmed up", this would likely be more agnostic to the weights getting updated.
Description This is the Triton startup command
I am using the
--model-control-mode=explicit
mode. When I update the weight files in the same model directory and version, and then reload the model using the load API, do I need to perform warmup again?Expected behavior I hope I don't need to warm up the model again.