triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.12k stars 1.46k forks source link

Questions about model instances and dynamic batch when setting model concurrency #5579

Open YJHMITWEB opened 1 year ago

YJHMITWEB commented 1 year ago

Hi, I'd like to know that for example, when enabling model concurrency = 2, does Tritonserver run 2 streams for processing requests? And verses that, if using dynamic batch, is there just 1 stream, and all requests are packed into one batch? And how is model instance related to them?

rmccorm4 commented 1 year ago

Hi @YJHMITWEB,

  1. Can you elaborate on what setting you are referring to when you say model concurrency = 2 ?
  2. For dynamic batching, it will pack batchable requests that come to Triton within a certain window into a larger batched request before sending it to a model instance
  3. For increasing model instance count, it will allow you to serve more requests concurrently, for example when instance1 is busy computing a response, instance2 is available to take the next request.

I suggest reading through the documentation on these topics more as there is a lot of detailed explanations of these topics. This recent tutorial also goes into much detail.

YJHMITWEB commented 1 year ago

Hi @rmccorm4 ,

Thanks for the reply. I am referring to when using perf_analyzer with Triton Client, we can set the model concurrency, and I am wondering for example, when setting it to 2, what exactly happens in the backend? Does the server launch 2 different streams to handle each seperately? Or does the server instantiate 2 model instances on GPU?

lllloda commented 1 year ago

@YJHMITWEB I think with perf_analyzer, concurrency setting to 2 only related to the client, not the server.

rmccorm4 commented 1 year ago

Hi @YJHMITWEB, @lllloda is correct, perf_analyzer --concurrency 2 is specifying that perf analyzer will send requests with 2 threads in parallel. By default (concurrency=1), perf_analyzer will send requests with a single thread, and won't send the next request until the previous response is returned,

YJHMITWEB commented 1 year ago

Hi @rmccorm4 @lllloda , thanks a lot for the information. So I am still a little bit confused here about the following: 1). In your previous answer, what exactly is a model instance? Is it a tritonserver stream? Say if I set the model instance to 2, then there are 2 streams running at the same time? And both streams share the same model weight?

2). In the scenario where client concurrency is 2 and perf_analyzer sends 2 requests in parallel, then on the tritonserver side, what would be the behavior? Will it use dynamic batch to combine 2 request? Or will it queue up the 2 requests? And also I noticed in perf_analyzer, there is an option --async, and the description is:

--async (-a): Enables asynchronous mode in perf_analyzer. By default,
         perf_analyzer will use synchronous API to request inference. However, if
         the model is sequential then default mode is asynchronous. Specify
         --sync to operate sequential models in synchronous mode. In synchronous
         mode, perf_analyzer will start threads equal to the concurrency
         level. Use asynchronous mode to limit the number of threads, yet
         maintain the concurrency.

Does sequential here mean that this is an ensemble model with preprocessing and postprocessing? And requests will be fed into tritonserver without waiting previous ones finish?