triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.04k stars 1.44k forks source link

Model loading and unloading results in slower model inference on other Gpus #6443

Open sunkenQ opened 11 months ago

sunkenQ commented 11 months ago

Description I have a very specific use case where between 09:00:00 and 10:00:00 there are around 300 models that are loaded according to a scheduled task. These models will do inference once between 30 seconds and 1 minute after the model is loaded, and then the model will be unloaded after inference. I found that models receiving inference requests were slowed down because other models were loading/unloading at the same time, even if the model receiving inference requests and the model being loaded/unloaded were on different Gpus. Example: In pytorch_libtorch platform, model (gpu 1) takes 600ms to perform inference in shared cuda memory mode while Model B is being loaded into gpu 2. But without model loading and unloading, inference takes only 120ms.

Triton Information docker: nvcr.io/nvidia/tritonserver:23.07-py3 driver: NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 GPUs: 6 x NVIDIA A100 80GB PCIe

To Reproduce Start two processes, the first to continuously reason about models already loaded into the TritonServer, and the other to load and unload other models every 3 seconds. This happens in both Libtorch and TensorRT platform, but TensorRT platform has less impact on inference.

I only used a very simple 5-layer MLP model, and I think this problem can be replicated on any model.

Expected behavior I wanted model loading and unloading not to affect inference on other models, or at least not to affect inference when loading/unloading on different Gpus.

jbkyang-nvi commented 10 months ago

Thanks for the suggestion @sunkenQ . @GuanLuo thoughts?