triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.18k stars 1.46k forks source link

Configurable Parallel Model Loading (Python backend) #7094

Closed tnesztler closed 4 months ago

tnesztler commented 6 months ago

Description On the 23.07 release, we are able to load multiple models and instances of a model. We use:

We can for example, on top of other models, load 7 Whisper's large v3 instances on a 42GB RAM instance. It is working as intended since all models are loaded sequentially. When moving to 23.08 and up, concurrency for the Python backend was activated:

RETURN_IF_ERROR(TRITONBACKEND_BackendAttributeSetParallelModelInstanceLoading(backend_attributes, true));`

We get as a result an OutOfMemory (OOM) error - not present in the logs as is. It is reported as an unhealthy model instance.

How can we batch the loading of models to prevent an OOM error (let say 2 by 2 for a specific model and keep the rest in parallel)?

Triton Information Triton 23.08 and up.

To Reproduce Models: 7 instances of Whisper Large v3 on 32GB VRAM GPU, 42GB RAM server.

Expected behavior I expect to have options to disable parallel loading in version 23.08 and up to have backwards compatibility with 23.07 for the Python backend or even better, the ability to set the number of instances of a specific model to load in parallel (not the same amount as in the instance_group config).

rmccorm4 commented 6 months ago

Hi @tnesztler,

Thanks for raising this issue with such detail!

I expect to have options to disable parallel loading in version 23.08 and up to have backwards compatibility with 23.07 for the Python backend

You should be able to disable parallel model instance loading via environment variable for now, such as:

export TRITON_PARALLEL_INSTANCE_LOADING=0

I will add a note to the documentation to include this.

or even better, the ability to set the number of instances of a specific model to load in parallel (not the same amount as in the instance_group config)

This would be a feature request for us to prioritize and follow-up with (ref: DLIS-6504).

Is the use case here that significantly more memory is required during the process of loading the model than is actually needed after the load completes? I assume since the 7 instances fit within 42GB when loaded sequentially, there is some significant amount of temp space used during load that causes OOM when loaded in parallel. Adding any additional details here would help clarify as well.