Ensemble of models are executed over different devices

QMassoz commented 3 years ago

Description The ensemble scheduler unpredictably execute the models of a same ensemble over different GPUs. For example, given a sequential ensemble model like ensemble_AB = model_A -> model_B, it often happens that model_A executes on GPU 0 and model_B executes on GPU 1, introducing unwelcome memory transfers and poor performance.

Triton Information I observed this bug with these official Triton containers (I expect this bug to exist in all versions):

nvcr.io/nvidia/tritonserver:20.11-py3
nvcr.io/nvidia/tritonserver:21.06-py3
nvcr.io/nvidia/tritonserver:21.07-py3

To Reproduce Requirements:

a computer with at least 2 NVIDIA GPUs
the following model repository and scripts triton_ensemble_bug_report.tar.gz

Steps to reproduce the behavior:

pipenv install
[optional] to generate the pytorch scripts: pipenv run python make_pytorch_models.py
to start the triton server: ./run-server
to send requests and observe the bug: pipenv run python client.py

Description: The provided model repository is composed of 3 models:

model_A (pytorch script) that fills the first channel with the device id of its input
model_B (pytorch script) that fills the second channel with the device id of its input
ensemble_AB (ensemble backend) that sequentially combine model_A and model_B

The provided client.py sends requests to the triton server and prints out a red text when the bug occurs.

Screenshot from 2021-09-16 18-03-13

Note:

I tested on two machines: one with 2xQuadro P6000 and one with 8xTesla T4. I observed this bug on both.
This bug seems to happen for the first N requests, where N ranges from 0 to 17000.
Restarting the triton server makes this bug re-appear.

Expected behavior An ensemble of models should execute on the same device, or we should be able to enforce it (via a parameter in ensemble_scheduling).

deadeyegoodwin commented 3 years ago

By default, Triton creates one instance (copy) or each model on each available GPU. The model instance that is used for any given inference request (including those internal to an ensemble) is not configurable or controllable from the client. We understand your request for "ensemble affinity" option and will make a note of it, but at this point it is unlikely to be implemented any time soon.

Here are a couple of other ways to achieve what you want:

Replicate each ensemble and model in your model repo to have copies each with different name. Use instance_group to assign each to dedicated GPU. Client would be responsible for distributing ensemble requests across the different named ensembles.
Run 1 tritonserver for each gpu and limit the server to see only that individual gpu. Use a load balancer to distribute requests to the tritonservers.

naor2013 commented 3 years ago

@deadeyegoodwin I think the solutions you provided can take away from Triton's abilities. For example, your first solution is to expect the client to be the load balancer. If I run Triton on k8s, I expect k8s to be the load balancer. Your second solution can take away from Triton's ability to run on GPUs with all the optimizations there (running multiple models on the same GPU outside of the ensemble, running a model on multiple GPUs) and from k8s abilities to not be restricted to a specific hardware.

If the difference between running an ensemble on the same GPU and running it on many GPUs is really big (I didn't test it personally but it sounds like it should be big), this seems like something that should be fixed.

So I wonder, why is this something that is unlikely to be fixed soon? Is it something that is hard to implement, just not important enough or is the backlog just too full at the moment?

Thanks 😄

QMassoz commented 3 years ago

@deadeyegoodwin Server-side load balancing over multiple devices with a single entrypoint is the key feature of tritonserver. Let me explain my use case in the hope that this feature gets pushed up in the priority list ;)

I have a tensorrt model that takes a chw fp16 image and outputs a chw fp16 image. However, I want to fill requests and obtain responses in hwc/chw uint8/uint16 formats, so we need pre-&post-processing for format conversion.

At the moment, I have 4 tensorrt models in the repository (one for each format) where format conversion is implemented in custom tensorrt plugins. The problem is that I need to generate 4 tensorrt engines, to store 4 tensorrt tengines in the model repository, and tritonserver will create 4 tensorrt instances per GPU. It runs very well with minimal memory transfers but there is a waste of disk space and gpu memory.

Ensemble of models would have cleanly solved this problem by creating, per GPU, 1 tensorrt instance, 4 pre-, and 4 post-processing instances (e.g. in a custom backend). It just lacks the ability to match instances on the same device.

In wait of such feature, I see only one solution: write a custom backend that does everything end-to-end (pre-&post-processing and tensorrt execute). Given that we cannot specify multiple datatypes in the model configuration, I would need to specify the format of the request as an input (otherwise I need multiple models per GPU). On the big downside, I would not benefit from all the work that was put (and will be put) in the tensorrt backend.

deadeyegoodwin commented 3 years ago

I marked this issue an an enhancement request which means that we accept your suggestion that this would be a good enhancement. We don't have any timeline as to if/when we will implement it. It would not be a minor change.

triton-inference-server / server

Ensemble of models are executed over different devices #3373