triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.14k stars 1.46k forks source link

TensorRT model low throughput #6978

Open rs-ixz opened 6 months ago

rs-ixz commented 6 months ago

Description When CUDA Shared memory is used with HTTP/GRPC protocol, it is expected that the client allocates cuda memory on one of the devices and copies the data into it. On systems with multiple-GPUs (in my case, a machine with ~20 GPUs) , how is the client recommended to copy the data on the right device for best performance considering that the triton server handles scheduling of jobs between the GPUs. If Triton decides to execute the inference request on a different GPU, then would there be a significant penalty in copying data over? How can this be avoided?

From perf_analyzer experiments we notice that the perf_analyzer client seems to copy the data to GPU 0 and it appears that the TritonServer internally copies the data on to the right device for execution? (We can see that all 20 GPUs are being used while checking usage with nvidia-smi). This aligns with the increased latency of "Compute Input" step in the perf analysis when executed with 20 GPUs. How can we maximize throughput with perf_analyzer while running TensorRT models + CUDA Shared memory on multiple devices?

TensorRT + Cuda Shared Mem

perf_analyzer -m my_trt_model --shape inputimage:2048,2048,1 --measurement-interval 60000 -i HTTP --shared-memory cuda --output-shared-memory-size=16777216 --concurrency-range 20

Request concurrency: 20 Client: Request count: 65449 Throughput: 302.954 infer/sec Avg latency: 65973 usec (standard deviation 15656 usec) p50 latency: 36162 usec p90 latency: 65274 usec p95 latency: 506917 usec p99 latency: 538550 usec Avg HTTP time: 65955 usec (send/recv 142 usec + response wait 65813 usec) Server: Inference count: 65449 Execution count: 65449 Successful request count: 65449 Avg request latency: 65392 usec (overhead 57 usec + queue 108 usec + compute input 25039 usec + compute infer 32872 usec + compute output 7314 usec)

If alternatively, system shared memory is used for TensorRT model, we see a huge hit in both Compute Input and Compute Output timings, severely affecting throughput.

TensorRT + System Shared Mem

perf_analyzer -m my_trt_model --shape inputimage:2048,2048,1 --measurement-interval 60000 -i HTTP --shared-memory system --output-shared-memory-size=16777216 --concurrency-range 20

Request concurrency: 20 Client: Request count: 18896 Throughput: 87.4745 infer/sec Avg latency: 227984 usec (standard deviation 15816 usec) p50 latency: 53416 usec p90 latency: 1139611 usec p95 latency: 1394049 usec p99 latency: 1757626 usec Avg HTTP time: 227966 usec (send/recv 142 usec + response wait 227824 usec) Server: Inference count: 18896 Execution count: 18896 Successful request count: 18896 Avg request latency: 227376 usec (overhead 55 usec + queue 1158 usec + compute input 112292 usec + compute infer 35535 usec + compute output 78335 usec)

Same issue is not seen while working with TF backend and system shared memory for the same model. But TF does not give us the inference speed-up that TensorRT does.

TensorFlow + System Shared Mem

perf_analyzer -m my_tf_model --shape inputImage:2048,2048,1 --measurement-interval 60000 -i HTTP --shared-memory system --output-shared-memory-size=16777216 --concurrency-range 20

Request concurrency: 20 Client: Request count: 81465 Throughput: 377.05 infer/sec Avg latency: 53042 usec (standard deviation 3187 usec) p50 latency: 52593 usec p90 latency: 57166 usec p95 latency: 58773 usec p99 latency: 62349 usec Avg HTTP time: 53026 usec (send/recv 104 usec + response wait 52922 usec) Server: Inference count: 81465 Execution count: 81465 Successful request count: 81465 Avg request latency: 52612 usec (overhead 54 usec + queue 2286 usec + compute input 3324 usec + compute infer 42926 usec + compute output 4022 usec)

Triton Information Triton Version - 24.01 (same behavior seen in older versions like 23.07 as well) Driver: 545.23.08 Cuda: 12.3 GPU: T4

To Reproduce Steps to reproduce the behavior.

Model used - A U-Net variant image segmentation model. Backend - TensorRT (Comparison with TensorFlow backend) Precision used - FP16 Model Config :

name: "my_trt_model", platform: "tensorrt_plan", backend: "tensorrt", input: [ { name: "inputimage", data_type: TYPE_FP32, format: FORMAT_NONE, dims: [ -1, -1, 1 ], is_shape_tensor: false, allow_ragged_batch: false } ], output: [ { name: "output", data_type: TYPE_FP32, dims: [ -1, -1 ], label_filename: "", is_shape_tensor: false } ] instance_group [ { count: 1, kind: KIND_GPU } ] version_policy: { all { }}

model_warmup: [{ name: "Warmup", batch_size: 1, inputs: { key: "inputimage", value: { data_type: TYPE_FP32, dims: [2048,2048,1], random_data: true } } } ]

lkomali commented 6 months ago

@jbkyang-nvi @tanmayv25 Any thoughts?

rs-ixz commented 6 months ago

@lkomali , @jbkyang-nvi , @tanmayv25 , any chance someone was able to look at this? Any recommendations would help us retire this risk of using TensorRT with Triton.

lkomali commented 6 months ago

@rs-ixz The team is looking into it.

tanmayv25 commented 6 months ago

@rs-ixz Triton does not provide an option to the users to enforce a certain request to be executed on the specific model instance. Instead, Triton schedules the request to the next available instance. The incoming requests are held in queues within Triton core till there is an available Triton model instance. This is a simple solution for maximizing throughput. The execution time of each inference can be a bit dynamic. Hence, this reactive approach ensures that all the model instance running on each GPU are not starving and executing requests if there are any in the queue.

D2D copies across different GPUs would incur some latency costs but it won't be as expensive as H2D memory copies in system shared memory. Additionally, there is no efficient way of making sure all the model instances are executing requests when providing specific model instance with request. We might be able to prevent an cross-device memory copy if we let the client handle the data and request placement, however, this might lead to instance starvation.

rs-ixz commented 6 months ago

Thanks, @tanmayv25 , your response seems to suggest that the Throughput we see for CUDA shared memory is as expected.

My overarching problem is this: TensorRT models are not able to get high throughput although inference on GPU is faster than TensorFlow

  1. System shared memory --> Tensorflow backend behaves well for all input sizes

  2. System shared memory --> TensorRT backend suffers from really high compute input/ compute output timing for large inputs (2048x2048x1) --> 3x slower than TensorFlow throughput For smaller inputs (400x400x1), TensorRT outperformes TensorFlow backend (3x faster than TensorFLow throughptu)

  3. Cuda shared memory --> TensorRT backend suffers from high compute input timing but throughput matches tensorflow at best. We lose the Throughput gain we saw for small inputs in TensorRT w/ shared memory

We would like to get the benefits of TensorRT (faster inference) but we are presently limited by either System Shared memory being slow for Large inputs or CUDA Shared memory being overall slow and only meeting TensorFlow numbers for all input sizes.

Please let me know if any more details about the above will help address the issue.

tanmayv25 commented 6 months ago

Adding @Tabrizian who was investigating implicitly selecting model instance based on the input tensor data locality.

Cuda shared memory --> TensorRT backend suffers from high compute input timing but throughput matches tensorflow at best. We lose the Throughput gain we saw for small inputs in TensorRT w/ shared memory

The scale of 20 GPUs might be saturating the inter-device bandwidth. Could this be a use-case of resume looking into the feature?

@rs-ixz Can you share the perf_analyzer numbers for Tensorflow + Cuda Shared Memory as well?

rs-ixz commented 6 months ago

@tanmayv25 , Here are the perf_analyzer numbers for TF + Cuda shm:

Request concurrency: 20 Client: Request count: 61919 Throughput: 286.613 infer/sec Avg latency: 69775 usec (standard deviation 7604 usec) p50 latency: 64728 usec p90 latency: 91601 usec p95 latency: 108960 usec p99 latency: 134397 usec Avg HTTP time: 69760 usec (send/recv 99 usec + response wait 69661 usec) Server: Inference count: 61919 Execution count: 61919 Successful request count: 61919 Avg request latency: 69369 usec (overhead 42 usec + queue 7150 usec + compute input 13990 usec + compute infer 42697 usec + compute output 5490 usec)

Following is a summary plot of the four combinations (TF/TRT + CUDA/System shm). We can clearly see TRT + Sys Shm is best for smaller input and for larger ones TF + Sys Shm is the best.

image image image
tanmayv25 commented 6 months ago

Hi @rs-ixz, thanks for sharing your observations and sorry for the delayed response.

TensorRT models are not able to get high throughput although inference on GPU is faster than TensorFlow

We would definitely want to bridge this gap. From the numbers that you have shared, there seems to be a bottleneck in preparing the input data and collect the output data from the results.

@rs-ixz I am going to create a ticket for investigation within team. Meanwhile can you share the models and exact steps for us reproduce the issue? Additionally, can you reproduce the issue on a system with lesser number of GPUs, such as 8? Is there a threshold on number of GPUs for observing this issue?

rs-ixz commented 5 months ago

@tanmayv25 , thanks for getting back on this.

From the numbers that you have shared, there seems to be a bottleneck in preparing the input data and collect the output data from the results.

Is this a comment regarding TensorRT backend with System shared memory or Cuda shared memory or both? Just to reiterate, System shared memory is our current baseline and performs well even with TensorRT except for large inputs. Cuda shared memory is not great for all sizes probably due to inter-device communication and the fact that we use 20 GPUs.

Meanwhile can you share the models and exact steps for us reproduce the issue?

I was able to reproduce the problem with an off-the shelf u-net model. Attached is an archive containing models and perf_analyzer results.

HEre is the plot comparison for this model with System Shared memory:

image

Additionally, can you reproduce the issue on a system with lesser number of GPUs, such as 8? Is there a threshold on number of GPUs for observing this issue?

I am working on getting this measured. Will it suffice if we set CUDA_VISIBLE_DEVICES with 8 devices on this same 20 GPU machine?

TRT_TritonSlowness.zip

tanmayv25 commented 5 months ago

I am working on getting this measured. Will it suffice if we set CUDA_VISIBLE_DEVICES with 8 devices on this same 20 GPU machine?

If you could observe the slowness with CUDA_VISIBLE_DEVICES=8, then we can take up from there. Are the above curves with CUDA_VISIBLE_DEVICES=8?

rs-ixz commented 5 months ago

@tanmayv25 , the above curves are still with 20 GPUs with System shared memory.

jbkyang-nvi commented 5 months ago

Hello @rs-ixz can you share the exact model you're using for TRT and Tensorflow? This is so there's no confusion in reproducing your results. If CUDA_VISIBLE_DEVICES=8, that should mean only 8 GPUs are available for Triton to use even though there are 20 GPUs.

rs-ixz commented 5 months ago

@jbkyang-nvi the models are available in the archive I attached few responses ago (TRT_Slowness.zip)

@jbkyang-nvi / @tanmayv25 , for system shared memory, we see the crossover happening at around 4/5 GPUs.

Note - this plot is only for input size 2048x2048x3 , with System shared memory. For each 'GPU Count', I set CUDA_VISIBLE_DEVICES to correspond number of devices. Attached are perf analyzer results

image

GPUSpread.zip

jbkyang-nvi commented 5 months ago

@jbkyang-nvi the models are available in the archive I attached few responses ago (TRT_Slowness.zip)

Thanks. Sorry I missed the zip file. How are you converting from tensorflow savedmodel to TRT plan model though? Are you going through ONNX?

rs-ixz commented 5 months ago

@jbkyang-nvi , no worries ! and yes, we are going through onnx route to get the TRT plan.

  1. Start with TF saved_model
  2. Run tf2onnx
  3. Run trtexec

example commands for tf2onnx and trtexec below:

image image
jbkyang-nvi commented 5 months ago

Thanks for your quick response! While I'm working on a reproducer, can you try creating the model with

–optShapes flags to control the range of input shapes including batch size.

according to https://docs.nvidia.com/tao/tao-toolkit/text/trtexec_integration/index.html And seeing if that helps?

rs-ixz commented 5 months ago

@jbkyang-nvi , I see from trtexec logs that optShape is by default set to the maxShapes (1x2512x2176x3). This should suffice right? From my recollection, I have tried generating trt engine plan for a single input size and the issue still occurred with it. Let me confirm this again.

jbkyang-nvi commented 5 months ago

@rs-ixz can you also list the GPUs you are using for measuring perf?

rs-ixz commented 5 months ago

@jbkyang-nvi , all are T4 GPUs on a single server.

rs-ixz commented 5 months ago

@jbkyang-nvi , any updates on reproducing the problem on your end?

tanmayv25 commented 5 months ago

@rs-ixz sorry for the delay. @indrajit96 is taking over this ticket and would let you know if he has some questions or findings. Thanks for your patience and prompt responses in this case!

indrajit96 commented 5 months ago

Hi @rs-ixz , we are able to reproduce this. We will update you as soon as we have a RCA/Fix/WAR. CC @tanmayv25

rs-ixz commented 5 months ago

Thank you, @indrajit96 and @tanmayv25 , this is highly encouraging. Just for my clarity, the investigation focus is going to be on System shared memory, correct?
And is there any chance you may be able to give a rough idea for the timeline of the investigation? Just so that we can plan TensorRT integration in our workflow accordingly.

indrajit96 commented 5 months ago

Hi @rs-ixz , Yes we are focusing on System Shared Memory, we will provide an estimate after an RCA. Currently we are actively in the process of RCA.

Thanks, Indrajit

rs-ixz commented 5 months ago

Sounds fair, thank you, @indrajit96

indrajit96 commented 4 months ago

Hello @rs-ixz , We repro-ed your issue on a 8GPU setup with concurrency set to 20 in Perf Analyzer. Fix: Use --pinned-memory-pool-byte-size at triton startup set the size to a suitably high value. The default is ~250MB. For 8GPU I set it to ~4GB. Usage Example: tritonserver --model-repository=/mnt --pinned-memory-pool-byte-size=4684354560

RCA: We ran the repro with NVTX flag enabled that helped us profile all the GPU related activity in Nsight. NVTX traces showed multiple calls to cudaHostAlloc. At every execution ProcessTensor calls FlushPendingPinned which in turn calls BackendMemory::Create if there's not enough CPU memory to allocate (This slows down the inference as cudaHostAlloc is called for every inference and is slow) If --pinned-memory-pool-byte-size is set suitabliy high calls to cudaHostAlloc are reduced.

CC @tanmayv25 @GuanLuo

rs-ixz commented 4 months ago

Thank you @indrajit96 for the quick debug and fix details! I will try this out on our 20 GPU system to see if it fixes the throughput problem.

A follow-up question - are there any caveats to be aware of while increasing the cpu pinned memory pool? On a system with CPU memory of 256G, can we increase it to say 16G without any negative side-effects assuming 16G of memory is available for use always?

indrajit96 commented 3 months ago

Hello @rs-ixz , Did the suggested flag resolve your issue? If yes we would like to close the issue Also regarding cpu pinned memory we have not seen any know downsides of using it with triton.

rs-ixz commented 3 months ago

Hi @indrajit96 , apologies, I was traveling and couldn't get these tests done earlier. I tried sweeping through a range of CPU_Pinned_memory size (default, 2GB, 4GB, 8GB, 16GB and 32 GB). It does appear that the higher setting has improved the throughput for larger input cases where we saw issues earlier. We see that throughput drops as the cpu pinned memory size is increased. 2GB seems to be the best setting for larger inputs.

However, I would also like to note that the throughput difference we see between TF FP32 and TRT FP16 is very minimal although the inference latency for TRT FP16 is ~4x faster than that of TF FP32. I have the perf results & models attached here. Any thoughts on this? Are we getting limited by I/O operations like earlier?

image

[Uploading CPUPinnedMemoryTest.zip…]()

indrajit96 commented 3 months ago

Hi @rs-ixz , I suspect the latency could be due to max_batch_size mismatch in models. Can you confirm both models have the same batch_size? You can check using curl localhost:8000/v2/models/"model name"/config

rs-ixz commented 3 months ago

@indrajit96 , max_batch_size setting was being autocompleted in the config. TRT was being autocompleted to 1 and TF max_batch_size was being set to 4. But note that input/output latency is higher for TRT than for TF.

I will attempt to match max_batch_size by setting to 1 and retrying the test.

rs-ixz commented 3 months ago

@indrajit96 , I checked by confirming max_batch_size is set to 1 for both models. With CPU pinned memory set to 16 GB, TRT throughput has improved from earlier baseline (TRT_FP16_Default). But overall throughput for larger images does not reflect the inference timing speed up given by TensorRT. I am further collecting numbers with other CPU pinned memory sizes (2GB, 4GB, 8GB, 32GB)

Attached are the perf_analyzer results and models.

image

BatchSize1_PerfAnalyzerResults.zip

rs-ixz commented 2 months ago

Hi @indrajit96 , any chance you got to look further on the graphs and results above where max_batch_size is set to 1 for both models? Please let me know if any additional data would help. I am still attempting to build triton with nvtx enabled to profile and see if I still see the cuda Host Alloc calls. Any tips from you would be much appreciated.

rs-ixz commented 1 month ago

@indrajit96 , @tanmayv25 , @Tabrizian , I was able to build triton with nvtx and profiled it with nsys for the following combinations: --> Default CPU Pinned memory – TensorRT backend (FP16 model) --> 16 GB CPU pinned memory – TensorRT backend (FP16 model) --> Default CPU Pinned memory – TensorFlow Backend (FP32 model) (All tests were run for 2048x2048x3 input size with concurrency set to 20. Max batch size = 1 for all cases)

My observations: TensorRT + Default CPU pinned memory shows a lot of cudaHostAlloc calls as noted by @indrajit96.

image

TensorRT + 16 GB CPU Pinned memory does not show cudaHostAlloc calls, but shows a lot of gaps (~ 40 ms) between successive inferences on a single device – Why is this the case? Is this the reason for low throughput as mentioned in previous comment? Further observation - average CPU usage is >99% in TensorRT, but TF backend uses ~35% CPU on average

image

TensorFlow + Default CPU pinned memory shows that the gaps between successive inferences on a single device are much smaller - ~17ms.

image

Need your help to understand why even after increasing CPU Pinned memory, throughput is low, i.e., why are we seeing gaps in the profile twice as big as TF? Would any additional data help to understand the problem? Appreciate your help so far, would love to get this fully resolved.

Note - I can share the profiling results, but the files are really large. Please let me know if you have recommendations on attaching these large profile results