triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.16k stars 1.46k forks source link

non stable inference response time with tensorflow saved model #7006

Open asaff1 opened 6 months ago

asaff1 commented 6 months ago

Is your feature request related to a problem? Please describe. I've facing an issue where triton server takes a long time to get a stable response time. I'm using triton c++ API for that purpose, but the same issue exists with triton server + perf analyzer. When running with concurrency of 800 requests, it takes roughly 10 minutes of sending requests to the server until I get a stable response time of 50ms. Until then the response time fluctuates and can be above 1 second.

I've noticed that when I do get high response time, the GPU utilization is high. See the chart from nvtop. image

After running requests for 10 minutes, the GPU utilization get stable around 25% and the response time is stable at 50ms as well. See: image

I assume that something must be happening on the server, since the client is sending requests with a constant throughput. How can I analyze this and understand what is going on?
I know that tensorflow models have a warmup and usually the first inference takes some time. But in this case it is happening for a long duration so I'm guessing it is something more complex then that. Maybe any of tensorflow / triton / cuda / etc do some optimizations in real time? I'd like to analyze what is happening. My goal is to get a fast response time without such a long "warmup".

Model config:

platform: "tensorflow_savedmodel"
max_batch_size: 128
dynamic_batching {
  max_queue_delay_microseconds: 40000
}

The model used here is an inception v3 model. GPU is A100.

Describe the solution you'd like How to get a stable response time. How can I analyze what is happening under the hood?

Additional context My c++ code used for measuring looks like: https://github.com/triton-inference-server/server/issues/6949#issuecomment-1980710475 I've added time measuring around the from AsyncInfer until fut.get() returns. But as I said perf analyzer running against tritonserver behaves the same.

Thanks for help.

indrajit96 commented 6 months ago

CC @kthui @rmccorm4

rmccorm4 commented 6 months ago

Hi @asaff1, thanks for the detailed description!

Warmup criteria and behavior can vary a bit with each framework. One suggestion I'd be interested in seeing the results of - can you try doing server-side warmup? This way when PA or a client starts hitting the server, ideally it is already warmed up or closer to warmed up.

There are docs here: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#model-warmup

And you can see an example warmup configuration here: https://github.com/triton-inference-server/server/blob/a168d519514af3d778fb6a28b26eac8e578f765a/qa/L0_warmup/failing_infer/config.pbtxt#L44-L56

You can choose random data, or use an input data file that is more representative of data you'd expect to see at runtime in your use case: https://github.com/triton-inference-server/common/blob/00b3a71519e32e3bc954e9f0d067e155ef8f1a6c/protobuf/model_config.proto#L1721

asaff1 commented 6 months ago

Hi @rmccorm4 , I've did some experiments with warmup. I've tried setting 100 warmup requests, with batch size = 1, yet it still taking a few minutes until response time get stable.

Another interesting thing to note is after the server is "warmed up" (= after sending requests at throughput of 800 image /sec for 6 minutes). Then even if I stop the client, so the server (and the GPU) is idle for a few hours, the next time the client starts the server will still be "warmed up" and will answer fast. Only if I stop and start the triton server process then I need to do warmup all over again.

So I can assume that the cause is somewhere in the software (either tensorflow, cuda, triton etc.), something might be doing optimization in real time? Or has some lazy initialization parts. I'm looking for info about that.

rmccorm4 commented 6 months ago

Hi @asaff1, does batch_size=1 encapsulate the types of requests you're expecting to see at runtime too? Or are you sending requests with greater batch sizes at runtime after model has loaded? Warmup data shapes should try to capture runtime expectations as much as possible, as different shapes can follow different inference paths, CUDA kernels, etc. - which may individually have some warmups based on per-framework details.

Another way to ask the question, after sending all of your 100 warmup requests for batch size 1, do you see at least stable response times for batch size 1? If not, is there a threshold of warmup requests (500, 1000, etc.) where you do see quicker stable response times? Does using random_data vs zero_data have a noticeable effect?

These are generally some framework/library specific concepts as you point out, at the tensorflow/cuda level for the majority of the "cold start penalty". CC @tanmayv25 if you have any more details/thoughts.

asaff1 commented 6 months ago

@rmccorm4 thanks for the detailed answer. Yes I do see improvements depending on the warmup batch size. Would be great to have a more in depth explanation about this. @tanmayv25

tanmayv25 commented 6 months ago

@asaff1 From the model configuration settings that you have provided, it seems that you are using dynamic batching with max_batch_size as 128. This means that depending upon the pending request counts, triton core can send request batches of sizes [1,128] to the tensorflow session for execution. Each tensorflow model would consume some memory resources for holding the model weights and dynamically allocate extra memory into the memory pool for the tensors depending upon its shape[including batch size].

I am assuming that it is taking you longer to get to the stable value because of the random batch sizes of the requests being forwarded to the TF model.

My recommendation would be to set batch_size to 128 and send realistic data(some models have data-dependent shapes as their outputs) as the warmup sample. This would ensure that resource pool is completely populated to handle requests with such a large batch size. You can also try sending 5 warmup requests.