triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.32k stars 1.48k forks source link

[RFC] Provide an option to start any backend out-of-proc to help with memory management on UNLOAD #5236

Open nikhil-sk opened 1 year ago

nikhil-sk commented 1 year ago

Is your feature request related to a problem? Please describe. (This is a high-level thought and a feature request, I will update this thread if I can gather more specific data)

  1. Currently, certain framework backends e.g., Tensorflow do not release memory when their respective model is unloaded. Other backends such as PyTorch or TensorRT may release partial memory depending on the caching configuration.
  2. This behavior leads to a mis-representation of available GPU memory, when queried using tools like nvidia-smi. It also prevents a different backend from allocating GPU memory although no model of the memory-blocking backend is currently being executed.
  3. For Tensorflow, this note is helpful - https://github.com/triton-inference-server/tensorflow_backend#how-does-the-tensorflow-backend-manage-gpu-memory, however, it suggests(among other options) that customers configure their workload such as TF models run within a single-process. This experience can be improved.
  4. In the python backend, each model runs out-of-proc with the tritonserver. When the model is unloaded, the associated C proc is killed and the memory is released back to the system.

Describe the solution you'd like

  1. Similar to the python backend, Triton can provide either a config.pbtxt option, or a tritonserver command line option to allow a particular backend/model to run out-of-proc with the Tritonserver.
  2. This may result in some performance degradation since in case of multiple processes, allocation might happen by logic of Time-Division-Multiplexing. However, for customers with deterministic workloads and low perf requirements, this can still be useful as it guarantees that GPU/CPU memory is released on model UNLOAD.
  3. There may be a couple of ways to do this: 3.1 One way of implementing this might be to make changes to Triton Core and allow any backend to be run out-of proc. 3.2 Another way might be to do the undifferentiated heavy-lifting on behalf of the customer and provide an option to run any backend via a python process (i.e. customer doesn't spend time to migrate to python backend, write python code etc.) e.g. if a TF savedmodel is provided by customer, and config.pbtxt can still say 'tensorflow' backend, but additionally an option is provided to run it out-of-proc, then Triton can run it via a python-process behind-the-scenes.
  4. Option 3.1 is preferable because it may provide better performance, than to run a python interpreter, however option 3.2 is also a good workaround if 3.1 is more challenging.
tanmayv25 commented 1 year ago

@nskool If the user is not concerned about the performance, then I would assume that users can run models via python API in python backend.

nikhil-sk commented 1 year ago

@tanmayv25 While it's true that certain users may not be concerned about performance and could do with python backend, the advantage to doing this is for customers to easily switch to an out-of-proc mode, and not have to write any python code. This reduces friction for the user. Additionally, without perf tests, we cannot be sure if out-of-proc framework backends perform the same as python backend, or worse, or better, IMO.

tanmayv25 commented 1 year ago

We have added an experimental feature of platform handlers that is similar to solution 3.2: https://github.com/triton-inference-server/python_backend/tree/main/src/resources/platform_handlers/tensorflow_savedmodel

Additionally, with more research and experimentation we were able to find that using jemalloc instead of generic malloc resolves most of the memory issues seen within in-process TF backend. The documentation on how to use jemalloc: https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md#model-control-mode-explicit