triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8k stars 1.44k forks source link

Memory leak in 20.09? #2109

Closed ghost closed 3 years ago

ghost commented 3 years ago

Description

We load up a couple of 2080s with ~75 Tensorflow .savemodel models

Memory usage is about 95%, then when inference starts this usage creps up to 99%. we then tested this with 1 model ~100MB and use your perf_client to do some load testing on different batch sizes and the memory eventually gets comepletly consumered.

Triton logs have lots of this type of thing:

...
2020-10-12 23:45:51.958146: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38c9ea00 next 30 of size 1536                                                                                                                                                                                                 
2020-10-12 23:45:51.958149: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38c9f000 next 31 of size 7424                                                                                                                                                                                                 
2020-10-12 23:45:51.958151: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38ca0d00 next 34 of size 1536                                                                                                                                                                                                 
2020-10-12 23:45:51.958153: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38ca1300 next 35 of size 1280                                                                                                                                                                                                 
2020-10-12 23:45:51.958155: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38ca1800 next 36 of size 3072                                                                                                                                                                                                 
2020-10-12 23:45:51.958158: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38ca2400 next 37 of size 12288                                                                                                                                                                                                
2020-10-12 23:45:51.958160: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38ca5400 next 38 of size 7424                                                                                                                                                                                                 
2020-10-12 23:45:51.958163: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38ca7100 next 39 of size 512                                                                                                                                                                                                  
2020-10-12 23:45:51.958165: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38ca7300 next 40 of size 4352                                                                                                                                                                                                 
2020-10-12 23:45:51.958167: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38ca8400 next 41 of size 4352                                                                                                                                                                                                 
2020-10-12 23:45:51.958169: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38ca9500 next 42 of size 7424                                                                                                                                                                                                 
2020-10-12 23:45:51.958172: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38cab200 next 43 of size 98304                                                                                                                                                                                                
2020-10-12 23:45:51.958174: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38cc3200 next 44 of size 4352                                                                                                                                                                                                 
2020-10-12 23:45:51.958177: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f8a38cc4300 next 18446744073709551615 of size 244992                                                                                                                                                                             
2020-10-12 23:45:51.958179: I tensorflow/core/common_runtime/bfc_allocator.cc:914]      Summary of in-use Chunks by size:                                                                                                                                                                                                       
2020-10-12 23:45:51.958185: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 7600 Chunks of size 256 totalling 1.86MiB                                                                                                                                                                                                    
2020-10-12 23:45:51.958188: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 2968 Chunks of size 512 totalling 1.45MiB                                                                                                                                                                                                    
2020-10-12 23:45:51.958191: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 3116 Chunks of size 768 totalling 2.28MiB                                                                                                                                                                                                    
2020-10-12 23:45:51.958193: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 3572 Chunks of size 1024 totalling 3.49MiB                                                                                                                                                                                                   
2020-10-12 23:45:51.958196: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 2812 Chunks of size 1280 totalling 3.43MiB                                                                                                                                                                                                   
2020-10-12 23:45:51.958199: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 3420 Chunks of size 1536 totalling 5.01MiB
...

ending with something like this:

...
2020-10-12 23:45:51.958921: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 29942784 totalling 28.56MiB
2020-10-12 23:45:51.958924: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 2 Chunks of size 46720000 totalling 89.11MiB
2020-10-12 23:45:51.958927: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 48848128 totalling 46.58MiB
2020-10-12 23:45:51.958929: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 8.29GiB
2020-10-12 23:45:51.958932: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 10090647296 memory_limit_: 10090647389 available bytes: 93 curr_region_allocation_bytes_: 17179869184
2020-10-12 23:45:51.958938: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: 
Limit:                 10090647389
InUse:                  8896387072
MaxInUse:              10090647296
NumAllocs:                   58208
MaxAllocSize:           1358935552

2020-10-12 23:45:51.960590: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ****__________**************************************************************************************

This behaviour keeps repeating. Inference takes minutes if we keep the batch size to 1 anything greater than that and it stops working completely.

Models are Efficientnet-b5. Can send you one of our models if you want to test this.

Triton Information Container 20.09

To Reproduce Load 75 of these models on a 2080 and try and do inference.

Expected behavior Memory is constant and inference is fast

deadeyegoodwin commented 3 years ago

Tensorflow framework allocates some memory when the model is loaded but will dynamic allocate more (potentially a lot more) as it is processing inference requests. Do you have some evidence to expect memory usage to be constant with tensorflow savedmodel? Have you tried running the model directly using TF apis (for example by writing a python script to load and run)? Triton has a model analyzer project that can help you understand the memory requirements of your project: https://github.com/triton-inference-server/model_analyzer

ghost commented 3 years ago

Would converting the model to TensorRT first stop this?

With Triton 1.x we didn't see this behaviour with the same models

deadeyegoodwin commented 3 years ago

TensorFlow's behavior wrt memory allocation hasn't changed in a long time, so I don't know why you are seeing something different now. TensorRT models allocate their entire GPU memory requirement when they are loaded, so you would not have this problem of memory growing during inferencing... at least from the model. Memory usage could still grow if you were overloading the server and so many requests were queuing up, but this should be CPU memory and there are model configuration settings to limit the queue size (https://github.com/triton-inference-server/server/blob/master/docs/model_configuration.md#scheduling-and-batching).

With TensorRT you pick the maximum allowed batch size when you create the model and so memory required to support that will be allocated when the model is loaded.

ghost commented 3 years ago

So you don't know how your software behaves, and you don't know how long it has behaved this way i.e. how long is "long time" in software? Don't you mean: since version x.x?

We don't have a CPU problem the server has 28 cores. They never max out.

When we compile Efficientnet from the savedmodel to TensorRT we get a lots of "Algorithm not found" errors.

konstantinos-imagr commented 3 years ago

Hmm, have you tried to load everything on a dgx?

ghost commented 3 years ago

The TensorRT problem was specific to a particular combination of TensorRT version and the CUDA driver version its being run on