Open QMassoz opened 3 years ago
By default, Triton creates one instance (copy) or each model on each available GPU. The model instance that is used for any given inference request (including those internal to an ensemble) is not configurable or controllable from the client. We understand your request for "ensemble affinity" option and will make a note of it, but at this point it is unlikely to be implemented any time soon.
Here are a couple of other ways to achieve what you want:
@deadeyegoodwin I think the solutions you provided can take away from Triton's abilities. For example, your first solution is to expect the client to be the load balancer. If I run Triton on k8s, I expect k8s to be the load balancer. Your second solution can take away from Triton's ability to run on GPUs with all the optimizations there (running multiple models on the same GPU outside of the ensemble, running a model on multiple GPUs) and from k8s abilities to not be restricted to a specific hardware.
If the difference between running an ensemble on the same GPU and running it on many GPUs is really big (I didn't test it personally but it sounds like it should be big), this seems like something that should be fixed.
So I wonder, why is this something that is unlikely to be fixed soon? Is it something that is hard to implement, just not important enough or is the backlog just too full at the moment?
Thanks 😄
@deadeyegoodwin Server-side load balancing over multiple devices with a single entrypoint is the key feature of tritonserver. Let me explain my use case in the hope that this feature gets pushed up in the priority list ;)
I have a tensorrt model that takes a chw fp16 image and outputs a chw fp16 image. However, I want to fill requests and obtain responses in hwc/chw uint8/uint16 formats, so we need pre-&post-processing for format conversion.
At the moment, I have 4 tensorrt models in the repository (one for each format) where format conversion is implemented in custom tensorrt plugins. The problem is that I need to generate 4 tensorrt engines, to store 4 tensorrt tengines in the model repository, and tritonserver will create 4 tensorrt instances per GPU. It runs very well with minimal memory transfers but there is a waste of disk space and gpu memory.
Ensemble of models would have cleanly solved this problem by creating, per GPU, 1 tensorrt instance, 4 pre-, and 4 post-processing instances (e.g. in a custom backend). It just lacks the ability to match instances on the same device.
In wait of such feature, I see only one solution: write a custom backend that does everything end-to-end (pre-&post-processing and tensorrt execute). Given that we cannot specify multiple datatypes in the model configuration, I would need to specify the format of the request as an input (otherwise I need multiple models per GPU). On the big downside, I would not benefit from all the work that was put (and will be put) in the tensorrt backend.
I marked this issue an an enhancement request which means that we accept your suggestion that this would be a good enhancement. We don't have any timeline as to if/when we will implement it. It would not be a minor change.
Description The ensemble scheduler unpredictably execute the models of a same ensemble over different GPUs. For example, given a sequential ensemble model like ensemble_AB = model_A -> model_B, it often happens that model_A executes on GPU 0 and model_B executes on GPU 1, introducing unwelcome memory transfers and poor performance.
Triton Information I observed this bug with these official Triton containers (I expect this bug to exist in all versions):
To Reproduce Requirements:
Steps to reproduce the behavior:
pipenv install
pipenv run python make_pytorch_models.py
./run-server
pipenv run python client.py
Description: The provided model repository is composed of 3 models:
The provided client.py sends requests to the triton server and prints out a red text when the bug occurs.
Note:
Expected behavior An ensemble of models should execute on the same device, or we should be able to enforce it (via a parameter in ensemble_scheduling).