Open rs-ixz opened 6 months ago
@jbkyang-nvi @tanmayv25 Any thoughts?
@lkomali , @jbkyang-nvi , @tanmayv25 , any chance someone was able to look at this? Any recommendations would help us retire this risk of using TensorRT with Triton.
@rs-ixz The team is looking into it.
@rs-ixz Triton does not provide an option to the users to enforce a certain request to be executed on the specific model instance. Instead, Triton schedules the request to the next available instance. The incoming requests are held in queues within Triton core till there is an available Triton model instance. This is a simple solution for maximizing throughput. The execution time of each inference can be a bit dynamic. Hence, this reactive approach ensures that all the model instance running on each GPU are not starving and executing requests if there are any in the queue.
D2D copies across different GPUs would incur some latency costs but it won't be as expensive as H2D memory copies in system shared memory. Additionally, there is no efficient way of making sure all the model instances are executing requests when providing specific model instance with request. We might be able to prevent an cross-device memory copy if we let the client handle the data and request placement, however, this might lead to instance starvation.
Thanks, @tanmayv25 , your response seems to suggest that the Throughput we see for CUDA shared memory is as expected.
My overarching problem is this: TensorRT models are not able to get high throughput although inference on GPU is faster than TensorFlow
System shared memory --> Tensorflow backend behaves well for all input sizes
System shared memory --> TensorRT backend suffers from really high compute input/ compute output timing for large inputs (2048x2048x1) --> 3x slower than TensorFlow throughput For smaller inputs (400x400x1), TensorRT outperformes TensorFlow backend (3x faster than TensorFLow throughptu)
Cuda shared memory --> TensorRT backend suffers from high compute input timing but throughput matches tensorflow at best. We lose the Throughput gain we saw for small inputs in TensorRT w/ shared memory
We would like to get the benefits of TensorRT (faster inference) but we are presently limited by either System Shared memory being slow for Large inputs or CUDA Shared memory being overall slow and only meeting TensorFlow numbers for all input sizes.
Please let me know if any more details about the above will help address the issue.
Adding @Tabrizian who was investigating implicitly selecting model instance based on the input tensor data locality.
Cuda shared memory --> TensorRT backend suffers from high compute input timing but throughput matches tensorflow at best. We lose the Throughput gain we saw for small inputs in TensorRT w/ shared memory
The scale of 20 GPUs might be saturating the inter-device bandwidth. Could this be a use-case of resume looking into the feature?
@rs-ixz Can you share the perf_analyzer numbers for Tensorflow + Cuda Shared Memory as well?
@tanmayv25 , Here are the perf_analyzer numbers for TF + Cuda shm:
Request concurrency: 20 Client: Request count: 61919 Throughput: 286.613 infer/sec Avg latency: 69775 usec (standard deviation 7604 usec) p50 latency: 64728 usec p90 latency: 91601 usec p95 latency: 108960 usec p99 latency: 134397 usec Avg HTTP time: 69760 usec (send/recv 99 usec + response wait 69661 usec) Server: Inference count: 61919 Execution count: 61919 Successful request count: 61919 Avg request latency: 69369 usec (overhead 42 usec + queue 7150 usec + compute input 13990 usec + compute infer 42697 usec + compute output 5490 usec)
Following is a summary plot of the four combinations (TF/TRT + CUDA/System shm). We can clearly see TRT + Sys Shm is best for smaller input and for larger ones TF + Sys Shm is the best.
Hi @rs-ixz, thanks for sharing your observations and sorry for the delayed response.
TensorRT models are not able to get high throughput although inference on GPU is faster than TensorFlow
We would definitely want to bridge this gap. From the numbers that you have shared, there seems to be a bottleneck in preparing the input data and collect the output data from the results.
@rs-ixz I am going to create a ticket for investigation within team. Meanwhile can you share the models and exact steps for us reproduce the issue? Additionally, can you reproduce the issue on a system with lesser number of GPUs, such as 8? Is there a threshold on number of GPUs for observing this issue?
@tanmayv25 , thanks for getting back on this.
From the numbers that you have shared, there seems to be a bottleneck in preparing the input data and collect the output data from the results.
Is this a comment regarding TensorRT backend with System shared memory or Cuda shared memory or both? Just to reiterate, System shared memory is our current baseline and performs well even with TensorRT except for large inputs. Cuda shared memory is not great for all sizes probably due to inter-device communication and the fact that we use 20 GPUs.
Meanwhile can you share the models and exact steps for us reproduce the issue?
I was able to reproduce the problem with an off-the shelf u-net model. Attached is an archive containing models and perf_analyzer results.
HEre is the plot comparison for this model with System Shared memory:
Additionally, can you reproduce the issue on a system with lesser number of GPUs, such as 8? Is there a threshold on number of GPUs for observing this issue?
I am working on getting this measured. Will it suffice if we set CUDA_VISIBLE_DEVICES with 8 devices on this same 20 GPU machine?
I am working on getting this measured. Will it suffice if we set CUDA_VISIBLE_DEVICES with 8 devices on this same 20 GPU machine?
If you could observe the slowness with CUDA_VISIBLE_DEVICES=8, then we can take up from there. Are the above curves with CUDA_VISIBLE_DEVICES=8?
@tanmayv25 , the above curves are still with 20 GPUs with System shared memory.
Hello @rs-ixz can you share the exact model you're using for TRT and Tensorflow? This is so there's no confusion in reproducing your results. If CUDA_VISIBLE_DEVICES=8
, that should mean only 8 GPUs are available for Triton to use even though there are 20 GPUs.
@jbkyang-nvi the models are available in the archive I attached few responses ago (TRT_Slowness.zip)
@jbkyang-nvi / @tanmayv25 , for system shared memory, we see the crossover happening at around 4/5 GPUs.
Note - this plot is only for input size 2048x2048x3 , with System shared memory. For each 'GPU Count', I set CUDA_VISIBLE_DEVICES to correspond number of devices. Attached are perf analyzer results
@jbkyang-nvi the models are available in the archive I attached few responses ago (TRT_Slowness.zip)
Thanks. Sorry I missed the zip file. How are you converting from tensorflow savedmodel to TRT plan model though? Are you going through ONNX?
@jbkyang-nvi , no worries ! and yes, we are going through onnx route to get the TRT plan.
example commands for tf2onnx and trtexec below:
Thanks for your quick response! While I'm working on a reproducer, can you try creating the model with
–optShapes flags to control the range of input shapes including batch size.
according to https://docs.nvidia.com/tao/tao-toolkit/text/trtexec_integration/index.html And seeing if that helps?
@jbkyang-nvi , I see from trtexec logs that optShape is by default set to the maxShapes (1x2512x2176x3). This should suffice right? From my recollection, I have tried generating trt engine plan for a single input size and the issue still occurred with it. Let me confirm this again.
@rs-ixz can you also list the GPUs you are using for measuring perf?
@jbkyang-nvi , all are T4 GPUs on a single server.
@jbkyang-nvi , any updates on reproducing the problem on your end?
@rs-ixz sorry for the delay. @indrajit96 is taking over this ticket and would let you know if he has some questions or findings. Thanks for your patience and prompt responses in this case!
Hi @rs-ixz , we are able to reproduce this. We will update you as soon as we have a RCA/Fix/WAR. CC @tanmayv25
Thank you, @indrajit96 and @tanmayv25 , this is highly encouraging. Just for my clarity, the investigation focus is going to be on System shared memory, correct?
And is there any chance you may be able to give a rough idea for the timeline of the investigation? Just so that we can plan TensorRT integration in our workflow accordingly.
Hi @rs-ixz , Yes we are focusing on System Shared Memory, we will provide an estimate after an RCA. Currently we are actively in the process of RCA.
Thanks, Indrajit
Sounds fair, thank you, @indrajit96
Hello @rs-ixz , We repro-ed your issue on a 8GPU setup with concurrency set to 20 in Perf Analyzer. Fix: Use --pinned-memory-pool-byte-size at triton startup set the size to a suitably high value. The default is ~250MB. For 8GPU I set it to ~4GB. Usage Example: tritonserver --model-repository=/mnt --pinned-memory-pool-byte-size=4684354560
RCA: We ran the repro with NVTX flag enabled that helped us profile all the GPU related activity in Nsight. NVTX traces showed multiple calls to cudaHostAlloc. At every execution ProcessTensor calls FlushPendingPinned which in turn calls BackendMemory::Create if there's not enough CPU memory to allocate (This slows down the inference as cudaHostAlloc is called for every inference and is slow) If --pinned-memory-pool-byte-size is set suitabliy high calls to cudaHostAlloc are reduced.
CC @tanmayv25 @GuanLuo
Thank you @indrajit96 for the quick debug and fix details! I will try this out on our 20 GPU system to see if it fixes the throughput problem.
A follow-up question - are there any caveats to be aware of while increasing the cpu pinned memory pool? On a system with CPU memory of 256G, can we increase it to say 16G without any negative side-effects assuming 16G of memory is available for use always?
Hello @rs-ixz , Did the suggested flag resolve your issue? If yes we would like to close the issue Also regarding cpu pinned memory we have not seen any know downsides of using it with triton.
Hi @indrajit96 , apologies, I was traveling and couldn't get these tests done earlier. I tried sweeping through a range of CPU_Pinned_memory size (default, 2GB, 4GB, 8GB, 16GB and 32 GB). It does appear that the higher setting has improved the throughput for larger input cases where we saw issues earlier. We see that throughput drops as the cpu pinned memory size is increased. 2GB seems to be the best setting for larger inputs.
However, I would also like to note that the throughput difference we see between TF FP32 and TRT FP16 is very minimal although the inference latency for TRT FP16 is ~4x faster than that of TF FP32. I have the perf results & models attached here. Any thoughts on this? Are we getting limited by I/O operations like earlier?
[Uploading CPUPinnedMemoryTest.zip…]()
Hi @rs-ixz , I suspect the latency could be due to max_batch_size mismatch in models. Can you confirm both models have the same batch_size? You can check using curl localhost:8000/v2/models/"model name"/config
@indrajit96 , max_batch_size setting was being autocompleted in the config. TRT was being autocompleted to 1 and TF max_batch_size was being set to 4. But note that input/output latency is higher for TRT than for TF.
I will attempt to match max_batch_size by setting to 1 and retrying the test.
@indrajit96 , I checked by confirming max_batch_size is set to 1 for both models. With CPU pinned memory set to 16 GB, TRT throughput has improved from earlier baseline (TRT_FP16_Default). But overall throughput for larger images does not reflect the inference timing speed up given by TensorRT. I am further collecting numbers with other CPU pinned memory sizes (2GB, 4GB, 8GB, 32GB)
Attached are the perf_analyzer results and models.
Hi @indrajit96 , any chance you got to look further on the graphs and results above where max_batch_size is set to 1 for both models? Please let me know if any additional data would help. I am still attempting to build triton with nvtx enabled to profile and see if I still see the cuda Host Alloc calls. Any tips from you would be much appreciated.
@indrajit96 , @tanmayv25 , @Tabrizian , I was able to build triton with nvtx and profiled it with nsys for the following combinations: --> Default CPU Pinned memory – TensorRT backend (FP16 model) --> 16 GB CPU pinned memory – TensorRT backend (FP16 model) --> Default CPU Pinned memory – TensorFlow Backend (FP32 model) (All tests were run for 2048x2048x3 input size with concurrency set to 20. Max batch size = 1 for all cases)
My observations: TensorRT + Default CPU pinned memory shows a lot of cudaHostAlloc calls as noted by @indrajit96.
TensorRT + 16 GB CPU Pinned memory does not show cudaHostAlloc calls, but shows a lot of gaps (~ 40 ms) between successive inferences on a single device – Why is this the case? Is this the reason for low throughput as mentioned in previous comment? Further observation - average CPU usage is >99% in TensorRT, but TF backend uses ~35% CPU on average
TensorFlow + Default CPU pinned memory shows that the gaps between successive inferences on a single device are much smaller - ~17ms.
Need your help to understand why even after increasing CPU Pinned memory, throughput is low, i.e., why are we seeing gaps in the profile twice as big as TF? Would any additional data help to understand the problem? Appreciate your help so far, would love to get this fully resolved.
Note - I can share the profiling results, but the files are really large. Please let me know if you have recommendations on attaching these large profile results
Description When CUDA Shared memory is used with HTTP/GRPC protocol, it is expected that the client allocates cuda memory on one of the devices and copies the data into it. On systems with multiple-GPUs (in my case, a machine with ~20 GPUs) , how is the client recommended to copy the data on the right device for best performance considering that the triton server handles scheduling of jobs between the GPUs. If Triton decides to execute the inference request on a different GPU, then would there be a significant penalty in copying data over? How can this be avoided?
From perf_analyzer experiments we notice that the perf_analyzer client seems to copy the data to GPU 0 and it appears that the TritonServer internally copies the data on to the right device for execution? (We can see that all 20 GPUs are being used while checking usage with nvidia-smi). This aligns with the increased latency of "Compute Input" step in the perf analysis when executed with 20 GPUs. How can we maximize throughput with perf_analyzer while running TensorRT models + CUDA Shared memory on multiple devices?
TensorRT + Cuda Shared Mem
If alternatively, system shared memory is used for TensorRT model, we see a huge hit in both Compute Input and Compute Output timings, severely affecting throughput.
TensorRT + System Shared Mem
Same issue is not seen while working with TF backend and system shared memory for the same model. But TF does not give us the inference speed-up that TensorRT does.
TensorFlow + System Shared Mem
Triton Information Triton Version - 24.01 (same behavior seen in older versions like 23.07 as well) Driver: 545.23.08 Cuda: 12.3 GPU: T4
To Reproduce Steps to reproduce the behavior.
Model used - A U-Net variant image segmentation model. Backend - TensorRT (Comparison with TensorFlow backend) Precision used - FP16 Model Config :