triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.71k stars 1.42k forks source link

Tritonserver may be load model multi times #7058

Open vonchenplus opened 3 months ago

vonchenplus commented 3 months ago

Description Using tritonserver to delay loading(--model-control-mode=explicit) the llava-mixtral-8x7b model, there is a probability that when my client initiates load_model, it triggers the server to load the same model multiple times(There is a certain probability).

Triton Information nvcr.io/nvidia/tritonserver:23.08-py3

To Reproduce

  1. Start tritonserver with --model-control-mode=explicit
  2. start create grpc client and try to load_model(Multiple loading by multiple process).

use python backend, and load llava-maxtral-8x7b in initialize method.

Expected behavior Models should only be loaded once

The following is the error log triton_server.log

nnshah1 commented 3 months ago

@vonchenplus would it be possible to confirm this with 24.03 release?

vonchenplus commented 3 months ago

@vonchenplus would it be possible to confirm this with 24.03 release?

Hello @nnshah1, Still have the same problem with 24.02.

The following is the error log triton_server.log

nnshah1 commented 3 months ago

Thanks for the confirmation - will try to reproduce