triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.38k stars 1.49k forks source link

Do I need to warm up the model again after reloading it? #7762

Open soulseen opened 3 weeks ago

soulseen commented 3 weeks ago

Description This is the Triton startup command

tritonserver --log-verbose=6 --log-info=true --log-warning=true --log-error=true --strict-model-config=false --http-port=5000 --model-repository=/models --load-model=* --model-control-mode=explicit

󠁪I am using the --model-control-mode=explicit mode. When I update the weight files in the same model directory and version, and then reload the model using the load API, do I need to perform warmup again?

Expected behavior I hope I don't need to warm up the model again.

rmccorm4 commented 1 week ago

Hi @soulseen, this probably depends on the framework/model more than Triton generically.

It's possible that loading new weights may introduce some cold start penalties depending on what the backend/framework is doing internally at runtime. It's possible that differences in weights in some models can result in different compute paths or kernels in some models (for example ONNX has some conditional ops that can branch in the model execution graph), so it's hard to answer that generically for all models/frameworks.

In terms of some general aspects of the GPU and CUDA-related runtimes being "warmed up", this would likely be more agnostic to the weights getting updated.