Open YJHMITWEB opened 1 year ago
Hi @YJHMITWEB,
model concurrency = 2
?I suggest reading through the documentation on these topics more as there is a lot of detailed explanations of these topics. This recent tutorial also goes into much detail.
Hi @rmccorm4 ,
Thanks for the reply. I am referring to when using perf_analyzer with Triton Client, we can set the model concurrency, and I am wondering for example, when setting it to 2, what exactly happens in the backend? Does the server launch 2 different streams to handle each seperately? Or does the server instantiate 2 model instances on GPU?
@YJHMITWEB I think with perf_analyzer, concurrency setting to 2 only related to the client, not the server.
Hi @YJHMITWEB, @lllloda is correct, perf_analyzer --concurrency 2
is specifying that perf analyzer will send requests with 2 threads in parallel. By default (concurrency=1), perf_analyzer will send requests with a single thread, and won't send the next request until the previous response is returned,
Hi @rmccorm4 @lllloda , thanks a lot for the information. So I am still a little bit confused here about the following: 1). In your previous answer, what exactly is a model instance? Is it a tritonserver stream? Say if I set the model instance to 2, then there are 2 streams running at the same time? And both streams share the same model weight?
2). In the scenario where client concurrency is 2 and perf_analyzer sends 2 requests in parallel, then on the tritonserver side, what would be the behavior? Will it use dynamic batch to combine 2 request? Or will it queue up the 2 requests? And also I noticed in perf_analyzer, there is an option --async
, and the description is:
--async (-a): Enables asynchronous mode in perf_analyzer. By default,
perf_analyzer will use synchronous API to request inference. However, if
the model is sequential then default mode is asynchronous. Specify
--sync to operate sequential models in synchronous mode. In synchronous
mode, perf_analyzer will start threads equal to the concurrency
level. Use asynchronous mode to limit the number of threads, yet
maintain the concurrency.
Does sequential
here mean that this is an ensemble model with preprocessing and postprocessing? And requests will be fed into tritonserver without waiting previous ones finish?
Hi, I'd like to know that for example, when enabling
model concurrency = 2
, does Tritonserver run 2 streams for processing requests? And verses that, if using dynamic batch, is there just 1 stream, and all requests are packed into one batch? And how is model instance related to them?