triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.04k stars 1.44k forks source link

Tensorflow models: Add support to specify multiple signatures in the configuration file and specify which one to use in the inference rerquest #6647

Open NiklasA11 opened 9 months ago

NiklasA11 commented 9 months ago

First of all, thank you very much for all effort put into this project. From what I have seen in the past couple of weeks investigating it I am really impressed by the state and performance of it!

When exporting Tensorflow models, it is possible to add multiple "signatures"/"entry-points" to the same model. This is very convenient when you e.g. want to have additional functionality from the same model.

One use case is when you have a network trained and want to be able to get either a probability, orthe most likely class, maybe some confidence-interval, maybe maybe some embeddings from an intermediate layer, or whatever. Another use-case is when you have additional signatures to export different kind of "meta-data" about the model, or you could have one "real time inference" signature, and another "debug-signature" that provides more information and data from intermediate steps in the model to help debugging the request.

Regardless the usecase (how valid or not, that discussion is not for here), it would be very beneficial, and help getting Triton functionality on par with the functionality available in Tensorflow Serving, if it was possible to set up inputs and output for all available signatures in your config.pbtxt file and specify which signature to use in an inference request.

a config would probably need to include something like

{
...
...
"signatures": { 
   "signature_1": {
          "input": [.....],
          "output": [.....]
  },
   "signature_2": {
          "input": [.....],
          "output": [.....]
  }
}
...
...
}

The client would the need to specify the signature name in the request, either in the post data or in the url ( POST /.../model_name/signature_name/...), whichever is easier to implement.

I know the workaround as described in another issue is to have multiple config files, one for each signature, but it is adding additional overhead by having to load the same model in memory multiple times. If I have a model of 7 GB and 8 signatures that's 56GB of memory just to keep 8 copies of the model in memory to be able to serve all the signatures. (yes, this is a real use case, questions/tips like "does it really make sense to have that many signatures" are not helpful :-) )

I know the comment in an older issue on a similar topic (https://github.com/triton-inference-server/server/issues/3795) was along the lines of "unlikely to be fixed because it would require change to model config and repository architecture", but that was almost 2 years ago so maybe things have changed in the meantime making this more likely to be implemented now, or something in the architecture has changed making this different this time.

I know many people with many usecases who would benefit heaps from having this functionality. From our investgation of Triton as an alternative to use, this is more or less the only real drawback that holds us back from migrating fully to Triton, and it is a bit shame that such a complete project is missing such a features when signatures are a "real thing" in the Tensorflow world.

I would be happy to contribute with thinking how a config would ideally look like etc, but unfortunately my C++ skills haven't been used for too long.

Happy to hear your thoughts, feel free to ask if anything is unclear.

References: https://blog.tensorflow.org/2021/03/a-tour-of-savedmodel-signatures.html

kthui commented 9 months ago

Hi @NiklasA11, thanks for the suggestion! From a quick reading on our current implementation, I think a signature of the model is picked upon loading the model on Triton, which the signature is also the same for all instances of the model. Your suggestion is to allow each inference request to select a signature of the model, so the model can behave differently on each inference request given the signature picked by the request. The main benefit is memory usage savings.

@tanmayv25 @nnshah1 what do you think about this feature request?

NiklasA11 commented 9 months ago

Hi @kthui !

Thank you very much for your response!

Indeed, I would like to be able to specify in each request which signature should be used.

I don't know if there are things that will "break" in other parts of the code by doing this, maybe something with the dynamic batching or some detail I am not aware of, but as a user, it would make life a lot easier (and cheaper when it comes to AWS costs) if I just need to load one copy* of the model and specify the signature on inference-time.

*yes, I know it is possible to config so that a model is loaded multiple times for higher throughput, but that's not the kind of copies of the model we are talking about here.

nnshah1 commented 9 months ago

@NiklasA11 - Is it possible to provide a superset of inputs / outputs and make some or all as optional? As a workaround?

Support for custom endpoints / routes is on our radar but I think will take some time to design (I haven't looked at TF signatures specifically but rather thinking of the general capability to provide additional routes for a model).

Another way to avoid the costs of multiple loads - would be to have multiple BLS or ensemble models all target the same underlying TF model. In that way you can create seperate 'wrapper' models and only load the main model once.

We have recently added support for "python based backends" and if you created a BLS you could also reuse it generically to support signatures. That is, the BLS model could autocomplete based on the config file a specific input / outputs and take configuration. Using model loading you could also potentially load the TF model itself if not loaded.

That would also be a good way to prototype the feature.

NiklasA11 commented 9 months ago

@NiklasA11 - Is it possible to provide a superset of inputs / outputs and make some or all as optional? As a workaround?

I am not sure I follow, could you please elaborate?

Support for custom endpoints / routes is on our radar but I think will take some time to design (I haven't looked at TF signatures specifically but rather thinking of the general capability to provide additional routes for a model).

Another way to avoid the costs of multiple loads - would be to have multiple BLS or ensemble models all target the same underlying TF model. In that way you can create seperate 'wrapper' models and only load the main model once.

Yeah, theoretically yes, but this is very inconvenient because:

1: I already have a. lot of logic incapsulated in the exported model and its different signatures. By creating the wrapper models in Triton I would need to duplicate the logic I already have built into the Tensorflow model into some Triton ensemble model, wouldn't I?

2: The different signatures don't necessarily need to all query the same underlying model. For examples we could have the following signatures to expose different data on different endpoints in out API. (Note that all signatures are on the same model export to make sure everything from preprocessing to thresholding and metadata is packaged together so we are sure than when a new model is loaded, we get the new embeddings, the new threshold, etc, in one go.)

We have trained one embedding model that can handle both text or image or "text and image". We have the following signatures on the exported tensorflow model:

So we have all in all 13 different signatures on the same model. The beauty of Tensorflow Serving is that it just loads the model and let us specify at request time which signature to use. The workaround to "create different wrappers" would mean I need to duplicate a lot of the logic I already have in the exported model and keep those two versions in sync with each other (i.e. the logic in the model and the logic I deploy to Triton). This becomes a nightmare, compared to just having one model artifact that is loaded and lets me specify the signature to use at request time. We want to keep as much logic as possible encapsulated in the Tensorflow export and not move any logic to Triton to make the versioning etc easier.

nnshah1 commented 9 months ago

Thanks for the details! Will take a look in more detail into signatures. This looks like something that would have to be implemented in the tensor flow backend directly. It's still unlikely to be addressed in the near future - but will keep open here until we have an updated understanding of the scope.

NiklasA11 commented 9 months ago

Thanks! Feel free to reach out if you have more questions or comments or anything whatsoever related to this.

NiklasA11 commented 7 months ago

Hi,

Has this become more or less likely to be picked up at some point in time? We would really really appreciate it as it would allow us to fully migrate to Triton.

Thanks in advance, Niklas

nnshah1 commented 7 months ago

As a clarification: if you could load the model once, but still have multiple configs (one for each signature) - would that be sufficient?

I'm thinking that this could potentially be a feature added specifically to the tensorflow backend to be able to have triton model instances share a single tensorflow model internally

In that way from a Triton perspective there wouldn't need to be a change in how models are selected or the api - just multiple signatures as seperate models ?

NiklasA11 commented 7 months ago

Hi!

Well, I would need to specify the inputs/outputs somehow for each signature anyway, so I don't really care if I need to do that in the same file, in multiple files in the same folder, or in multiple files in multiple folders.

My main concerns is that I do not want to have 4 x 2GB copies of the model to support some signatures that are used once or twice a day to fetch some metadata for a dashboard, or some threshold value we need for the vector search, or doing some debugging to get some internal numbers from the signature. That feels like a waste of space, and that RAM could be used to process requests rather than keeping 8GB of "I'm here if you need me"-once-a-day-data.

I know TensorflowServing has quite some memory overhead (750MB model grows to 3GB RAM, so maybe if the memory overhead in Triton is smaller, it is still "ok"), but I just feel there may be a more optimized way to do this rather than loading so many unnecessary copies into memory.

Thanks in advance, Niklas

kmkolasinski commented 5 months ago

Hi @nnshah1 I can relate and it seems that not having possibility to pick at runtime function signature is a major blocker for us. Here is an example code which creates a simple SavedModel with two signatures and demonstrate usefulness of this feature.

In our case, we may have like 4 to 6 different signatures for the same model. We heavily use this feature with TFServing.

import tensorflow as tf
from tensorflow.python.saved_model.model_utils import get_timestamped_export_dir

class Module(tf.Module):

    def __init__(self):
        self.x0 = tf.random.uniform([3, 3])
        self.x1 = tf.random.uniform([3, 3])

    @tf.function(input_signature=(tf.TensorSpec([None, 3], tf.float32),))
    def embeddings(self, y):
        return {"embeddings": y @ self.x0}

    @tf.function(input_signature=(tf.TensorSpec([None, 3], tf.float32),))
    def predict(self, y):
        embeddings = self.embeddings(y)["embeddings"]
        logits = embeddings @ self.x1
        probs = tf.nn.softmax(logits)
        classes = tf.argmax(probs, axis=-1)
        scores = tf.reduce_max(probs, axis=-1)
        return {"classes": classes, "scores": scores}

moddule = Module()
export_dir = "/tmp/example-module"
signatures = {
    "embeddings": moddule.embeddings.get_concrete_function(),
    "predict": moddule.predict.get_concrete_function(),
}
export_dir = get_timestamped_export_dir(str(export_dir)).decode()
tf.saved_model.save(moddule, export_dir, signatures=signatures)

Running the predictions

predictor = tf.saved_model.load(export_dir)
predictor.embeddings(tf.random.uniform([2, 3]))
predictor.predict(tf.random.uniform([2, 3]))
nnshah1 commented 5 months ago

Thanks @kmkolasinski , @NiklasA11 for the details. Unfortunately we haven't had a chance to scope the effort on this yet - and we do not currently have a timeline. I definitely see the value in the feature - any chance you'd be able to scope out the changes required in the tensor flow backend?

kmkolasinski commented 5 months ago

thanks for reply. do you have some docs on how to contribute ? Do you see any potential issues with runtime signatures e.g. potential issues with dynamic batching, each signature may expects different inputs and outputs which may complicate etc ?

I know the problem can be solved by using SavedModel inside the triton python backend (with the super set of optional tensors in the input and outpus). I would need to benchmark it first, because it seems to be the easiest solution to implement w/o changing the triton server source code.

nnshah1 commented 5 months ago

My first inclination would be to look at the:

https://github.com/triton-inference-server/tensorflow_backend

And not the python backend - the reason is that the python backend spawns independent processes for each model / model instance.

Generically there are two ways to approach I think:

Option a: (doable without changes - and could be encapsulated in a new 'python based backend') With python introspection - probably could be fairly automated in terms of adding the signatures at runtime with python backend auto complete config features.

1) Create a model with a super set of input / outputs (or serialized generic byte tensor blob) that internally uses the different signatures 2) create proxy models with the appropriate signatures which then forward to the superset model with appropriate flag.

Option b: Update the tensor flow backend 1) Add additional support for multiple tf models to share a single internal model 2) create proxy models with appropriate signatures which via configuration alone use the same loaded model

The second option would be more performant as everything stays in the same process (python models have seperate process per model)

By using the proxy models - all the batching / etc. in the core would remain the same - as to the core they are the same as multiple models - the backend is where the logic of sharing would be implemented.

kmkolasinski commented 5 months ago

Thanks, I was thinking about "Option a)". Maybe I was just not clear enough :) I think I will try this one, as it seems to the easiest way to start for me. Python backend is not that bad if we will manage to minimize python related stuff + Tensorflow SavedModel can utilize multiple threads withing single python worker + we can have multiple models loaded into the single python worker. It's not in my backlog now but for sure I will try this idea some day.

nnshah1 commented 5 months ago

@NiklasA11 - would a python based solution work for your needs as well?

NiklasA11 commented 5 months ago

Hi there,

Really happy to see some traction on this one!

I haven't had the time to run any benchmarks on what the performance would be running a Tensorflow model through a python backend vs the TensorFlow backend that is provided through Triton. I believe however that as long as we keep the python overhead to a minimum, performance should still be acceptable.

In another project we are using Python backends to generate embeddings from the Transformers module, and the overhead we see there is barely noticeable, so probably it should be. good enough in this case as well. But as usual, there's only one way to find out :-)

I am not sure how the dynamic batching would work if requests come to the same Tensorflow model with different "flags"/signatures... I guess we'll have to figure out as we go.

kmkolasinski commented 5 months ago

Hi, I managed to find some time to create a simple demo of how one can use python backend to simulate TFServing like multi model and multi signature API. The model.py code is quite simple and I put there some assumptions, so definitely it's not a general purpose solution.

Here is my code: https://github.com/kmkolasinski/triton-saved-model

As you can see even for very simple models like ResNets and EfficientNetB0 the overhead is not that bad, especially for XLA/AMP compiled models.

I was not able to test TFServing with XLA/AMP compiled models, for some reason I get weird CUDA errors. Either I do something wrong or TFServing does not support XLA compiled models. Do you guys have experience in this ?

NiklasA11 commented 5 months ago

Hi all,

@kmkolasinski Thank you for taking some time and putting some effort into this!

I had a quick look at the code in your repo and just a few things that I maybe misunderstand, but I thought it's at least worth asking:

In the model.py https://github.com/kmkolasinski/triton-saved-model/blob/main/models/saved_model/1/model.py you have a function called batch-predict: def batch_predict(self, inputs_list): predictions = [] for predictor, inputs in inputs_list: outputs = predictor(**inputs) predictions.append({k: v.numpy() for k, v in outputs.items()}) return predictions This may be me not getting all the details, but it looks like there is no real "batching" taking place here. Wouldn't it be better performance-wise to build up the input tensor/array and then just pass all of it at once to the predictor? This may be complicated by the fact that each input may point to different models.

If my input list containst 512 different elements, this would call the predictor 512 times, wouldn't it? As opposed to just creating one input of size 512 and call the predictor 1 time with that single inpu, which would be much more performant.

kmkolasinski commented 5 months ago

Hi @NiklasA11 yes, I wanted to keep it simple here, however in order to enable batching you can group all the 512 requests by (model_name, signature) and batch requests per these groups separately.

Note, that in my example I was able to get similar results as TFServing with this simple approach. Also you mentioned you are expecting 512 requests in batch, but in my case these was much less, around few requests. Benefits of batching will depend on many factors, like model size, your GPU, triton configuration and traffic.