triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.14k stars 1.46k forks source link

Does ensemble model release CUDA cache? #5237

Open zeruniverse opened 1 year ago

zeruniverse commented 1 year ago

Description A clear and concise description of what the bug is.

We set cache cleaning args on all models running on GPU. After calling them individually repeatedly, the GPU RAM utilization will stay stable while idle. Then we call ensemble model. GPU RAM usage increases a lot. We are wondering if additional GRAM is consumed by input / output cache or tensor GPU <-> CPU move (we have python models to process outputs of previous models and generate inputs for following models. Python model runs on CPU so ensemble model needs to do the tensor GPU <-> CPU move)

Triton Information What version of Triton are you using?

r22.07, but I checked the release notes, there's no change regarding ensemble model CUDA memory stuffs after that

To Reproduce Steps to reproduce the behavior.

We have 2 ONNX models (A, C) and 3 pytorch models (B, D, E), and 2 Python models (P1, P2) for data processing. And all onnx models have memory.enable_memory_arena_shrinkage = "cpu:0;gpu:0"

all pytorch models have: ENABLE_CACHE_CLEANING = true

The ensemble routing is like:

IN1  IN2
|     |
A     B
|     |
 \   /
  P1
   |
  / \
 C   D
 |   |
  \ /
   P2
   |
   E
   |
 OUTPUT

When calling A,B,C,D,E individually repeatedly, the GPU memory stops increasing at idle after several rounds (if no inference requests, stays at 6GB). Then when calling the above ensemble model for several rounds, the GPU RAM usage increases to 11GB. The Python scripts run on CPU and I confirmed nvidia-smi does not have any python_stub process.

All aforementioned GPU RAM utilization is checked using nvidia-smi. Output tensor of B and P1 is about 1.5GB

Expected behavior A clear and concise description of what you expected to happen.

Calling ensemble models repeatedly would not increase GPU RAM utilization at idle compared to calling all GPU models individually. Or at least, there should be a configuration field in config.pbtxt for ensemble model like pytorch ENABLE_CACHE_CLEANING=true. If set, the behavior meets my expectation

zeruniverse commented 1 year ago

Not sure if relevant. When running those individual models, I don't see any warnings in Triton. But when run ensemble, I see following:

W0113 04:35:23.728355 1 memory.cc:183] Failed to allocate CUDA memory with byte size 1743446016 on GPU 0: CNMEM_STATUS_OUT_OF_MEMORY, falling back to pinned system memory
W0113 04:35:23.728386 1 pinned_memory_manager.cc:133] failed to allocate pinned system memory, falling back to non-pinned system memory
zeruniverse commented 1 year ago

With some experiment, I found in onnxruntime/pytorch model, if the input / output is dynamic shape. I first inference with data that generates large output. After model execution and arena shrinkage, say CUDA RAM taken by tritonserver is X, I then inference with data that generates small output. After model execution and arena shrinkage, say CUDA RAM taken by tritonserver is Y. Y < X. It seems the RAM related to output is not completely released after inferences.

sboudouk commented 3 months ago

@zeruniverse how did you manage to get this sorted out ? I think i'm facing the same issue.

zeruniverse commented 3 months ago

I implemented my own program to do ensemble. It seems this problem won't fix

sboudouk commented 3 months ago

You kept Triton Server but instead of calling an ensemble once you're calling your models 1 by 1 in your Client ?

zeruniverse commented 3 months ago

Yes, you can kind of think I implement the ensemble logic in client side

On Wed, 12 Jun 2024 at 15:51, Sami @.***> wrote:

You kept Triton Server but instead of calling an ensemble once you're calling your models 1 by 1 in your Client ?

— Reply to this email directly, view it on GitHub https://github.com/triton-inference-server/server/issues/5237#issuecomment-2162343031, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDO6NCYHUOC3BHHIYSPEZDZG74XHAVCNFSM6AAAAABJESX6MGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRSGM2DGMBTGE . You are receiving this because you were mentioned.Message ID: @.***>