Description
I have a very specific use case where between 09:00:00 and 10:00:00 there are around 300 models that are loaded according to a scheduled task. These models will do inference once between 30 seconds and 1 minute after the model is loaded, and then the model will be unloaded after inference.
I found that models receiving inference requests were slowed down because other models were loading/unloading at the same time, even if the model receiving inference requests and the model being loaded/unloaded were on different Gpus.
Example: In pytorch_libtorch platform, model (gpu 1) takes 600ms to perform inference in shared cuda memory mode while Model B is being loaded into gpu 2. But without model loading and unloading, inference takes only 120ms.
Triton Information
docker: nvcr.io/nvidia/tritonserver:23.07-py3
driver: NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2
GPUs: 6 x NVIDIA A100 80GB PCIe
To Reproduce
Start two processes, the first to continuously reason about models already loaded into the TritonServer, and the other to load and unload other models every 3 seconds. This happens in both Libtorch and TensorRT platform, but TensorRT platform has less impact on inference.
I only used a very simple 5-layer MLP model, and I think this problem can be replicated on any model.
Expected behavior
I wanted model loading and unloading not to affect inference on other models, or at least not to affect inference when loading/unloading on different Gpus.
Description I have a very specific use case where between 09:00:00 and 10:00:00 there are around 300 models that are loaded according to a scheduled task. These models will do inference once between 30 seconds and 1 minute after the model is loaded, and then the model will be unloaded after inference. I found that models receiving inference requests were slowed down because other models were loading/unloading at the same time, even if the model receiving inference requests and the model being loaded/unloaded were on different Gpus. Example: In pytorch_libtorch platform, model (gpu 1) takes 600ms to perform inference in shared cuda memory mode while Model B is being loaded into gpu 2. But without model loading and unloading, inference takes only 120ms.
Triton Information docker: nvcr.io/nvidia/tritonserver:23.07-py3 driver: NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 GPUs: 6 x NVIDIA A100 80GB PCIe
To Reproduce Start two processes, the first to continuously reason about models already loaded into the TritonServer, and the other to load and unload other models every 3 seconds. This happens in both Libtorch and TensorRT platform, but TensorRT platform has less impact on inference.
I only used a very simple 5-layer MLP model, and I think this problem can be replicated on any model.
Expected behavior I wanted model loading and unloading not to affect inference on other models, or at least not to affect inference when loading/unloading on different Gpus.