pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.19k stars 858 forks source link

Support for ensemble of models served in torchserve #682

Closed prashantsail closed 3 years ago

prashantsail commented 4 years ago

Is your feature request related to a problem? Please describe.

Describe the solution

prashantsail commented 4 years ago

Support for Ensemble Models in TorchServe

What are ensemble models?

- Ensemble model represents a pipeline of models and the flow of data between them.
- This can reduce the overhead of transferring intermediate tensors and minimize the number of requests that must be sent to torchserve.

Approach

  1. Ensemble will be a logical model representing a pipeline of required models.
  2. From an end user's perspective –
    • A single ensemble model will be registered.
    • Execute sync inference requests and get corresponding responses.
    • The pipeline of models along with the intermediate tensors are a blackbox.
  3. Internally though, the Frontend layer orchestrates request\response to the models in the pipeline.
  4. Ensemble model will not interfere with configuration\updating of models in the pipeline.

Impact on existing frontend APIs

Registration API

  1. All models in the ensemble have to be independently registered first.
  2. All models in the ensemble have to be scaled up to at least 1 worker.
  3. Being a logical model, we need a small tweak to api parameters during registration
    1. ensemble : (New) A json array representing a pipeline of models
    2. model_name: Name of the model
    3. batch_size: Inference batch size
    4. max_batch_delay: Maximum delay for batch aggregation
    5. response_timeout: inference response within this timeout period
    6. handler: Ignored as all orchestration will be in frontend layer
    7. url: Ignored as mar not needed
    8. initial_workers: Ignored as this model will not have any workers
    9. synchronous: Ignored as this model will not have any workers

Inference API

  1. Checks if all the models specified in the pipeline configuration are registered and have at least one active worker.
  2. Frontend will orchestrate request\response from models in the pipeline.

    Example: Inferencing with an ensemble model

    M1, M2, M3 are pre-registered models.

    image
    • A simple JSON config representing this pipeline: {“ensemble”: [“M1”, “M2”, “M3”]}
    • Frontend is responsible for :
      1. Collecting the _input_from the user's request (for ensemble model) and initiating a request to M1.
      2. Collecting output from M1 and initiating a request to M2.
      3. Collecting output from M2 and initiating a request to M3.
      4. Collecting output from M3 is and sending it back as a response to the end user.

Scale Up API

  1. Not Applicable \ Ignored as the ensemble model will not have any worker

Scale Down API

  1. Not Applicable \ Ignored as the ensemble model will not have any worker

Describe Model API

  1. Info of the ensembled model

Unregister API

  1. Removes the ensemble model

Points to Discuss

Topic Suggestion
Should ensemble model control the working\updating of the models in the pipeline?

Example -
  • Auto registration of models if not already present
  • Auto scale up of models
  • Blocking scale down of models
  • Blocking unregistration of models
  • No, ensemble model should not interfere with working\updating of the models.
  • Models in pipeline will be controlled by their individual management APIs
  • If models in the pipeline are unregistered or scaled down to 0, then the inference request on the ensemble model will error out with an appropriate message in the response.
Should we support responding with outputs from intermediate models? Yes, this should be an optional feature though.
Should we support nesting of ensemble models? Yes, this could be a secondary feature based on the effort required.
Should we support pipeline structures similar to those shown below –

  • Support parallel processing of M2 and M3 image
  • Support for multiple outputs image
We could take this up in Phase II, as this will overtly increase complexity and effort required for this feature.
prashantsail commented 4 years ago

@chauhang @dhanainme @maaquib - This is the approach we are thinking of. Let us know your thoughts.

chauhang commented 4 years ago

Thanks @prashantsail for putting this together.

  1. Can you also describe how the support for different pipelining options will be handled in future releases? Are you thinking of having something like the workflows for the AWS Lambda Step function?
  2. How will the batching get handled for the Ensemble models?
  3. What provision will be there for debugging the ensemble pipeline?
  4. Are we going to add new metrics for ensemble case?
harshbafna commented 4 years ago

@chauhang Please see the comments below

  1. Can you also describe how the support for different pipelining options will be handled in future releases? Are you thinking of having something like the workflows for the AWS Lambda Step function?

The user will define a pipeline/workflow while registering the ensemble model. We are thinking of following two approaches for how an end-user interfaces with TorchServe for registering an ensemble model. The ensemble model will be registered as a logical model.

Approach A :

Use the already existing model registration API with a new parameter named ensemble which takes a JSON data defining the pipeline. The registration model will ignore all other existing params except the model name in case the ensemble parameter is supplied.

e.g. POST /models?model_name=xyx&ensemble={[[“M1:v1”, “M2:v2”, “M3:v3”]]}

Pros :

Approach B :

Add new set of APIs to define a workflow which takes model name and flow as input.

E.g. POST /models/workflow?workflow_name=&flow=={[m1:v1->m2:v2->m3..]} DEL /models/workflow/workflow_name GET /models/workflow

Pros :

Cons :

  1. How will the batching get handled for the Ensemble models?

There can be again two approaches here :

Approach A: Supply batch_size while registering the ensemble model and update the batch size of every model in the pipeline before running the inference just to ensure every model uses the same batch size.

Approach B: Ensemble model doesn't support batching, instead depends on the configured batch size per model in the pipeline.

  1. What provision will be there for debugging the ensemble pipeline?

In case of failure, the inference response will provide the model name and related error from the pipeline where the inference failed. We will also see if we can return the intermediate output of the last model which completed successful inference. We will also enhance the logs to represent if it is for an API call on the normal model or ensemble model.

  1. Are we going to add new metrics for the ensemble case?

The pipeline will be treated as a logical model and the existing metric mechanism will be reused, where it will return metric data for the ensemble model as a whole and for individual models in the pipeline as well. Note: The flow will be broken into sequential inference jobs by TorchServe

harshbafna commented 4 years ago

@chauhang Please ignore the previous comment. We had an internal discussion today, there may be a different way to approach this. We will update this ticket in a day or two.

dhaniram-kshirsagar commented 4 years ago

@chauhang @lokeshgupta1975 @dhanainme @maaquib Based on the internal discussion, here are the updated approaches for ensemble support in TorchServe

Scope

Ensembles of models will support parallel inferencing with multiple models followed by an ensemble function(expressed in post-processing code) to return the inferred output.

scope

Out of scope

Design considerations

Proposed Approach(s)

We are proposing the following design approaches to support ensemble in TorchServe.

Approach-1 (Recommended approach)

In this approach, the ensemble model orchestration will be handled via the system provided ensemble (default) handler itself. This is our recommended approach.

approach-2-design

approach-2-reg-flow

approach-2-inf-flow

Approach-2

In this approach, TorchServe frontend layer will act as an orchestrator for ensemble model lifecycle management and inference. This approach will use the existing handler framework for loading the models present in the ensemble.

approach-1-design

approach-1-reg-flow

approach-1-inf-flow

lxning commented 4 years ago
  1. Should the ensemble model require each internal model's batch size must be 1 if client side batches multiple requests together and sends the request to ensemble model?
  2. Can ensemble model requests to be mixed with the regular model's request in each model's request batch?
harshbafna commented 4 years ago

@lxning

  1. Should the ensemble model require each internal model's batch size must be 1 if the client-side batches multiple requests together and sends the request to the ensemble model?

The batching for any model is done at TorchServe's frontend layer. The ensemble model's every internal model will use the same batch size that is specified for the ensemble model at the time of registration.

2. Can the ensemble model request to be mixed with the regular model's request in each model's request batch?

The model's workers for the ensemble model will be independent of the regular workers and will only serve the ensemble model requests.

Any model used in the ensemble can have its own independent workers for serving regular inference request.

dhaniram-kshirsagar commented 3 years ago

As discussed with @chauhang @lokeshgupta1975 @dhanainme @maaquib @harshbafna, here is the final plan for adding workflow support to torchserve.

Goal - Able to support the ensemble of models.

Assumptions/notes

Based on the above design constraints/notes, this will be introducing a new component for creating workflow archives apart from internal modules such as WFManager and WFExecutor.

Workflow Archiver [WAR] - This will be an independent CLI utility similar to the torchserve model archiver [MAR] utility. The main purpose of this component is to create WAR file using supplied workflow specification [yaml] file, .py [or python modules], and requirements.txt files.

models:
    #global model params
    min-workers: 1
    max-workers: 4
    batch-size: 8
    m1:
       url : model1.mar #local or S3 path
       min-workers: 1   #override the global params
       max-workers: 2
       batch-size: 4

    m2:
       url : model2.mar

    m3:
       url : model3.mar
       batch-size: 2

    m4:
      url : model4.mar

dag: #can have only one start node and one end node
  pre_processing : [m1, m3]
  m1 : [m2, m4]
  m2 : [post_processing]
  m4 : [post_processing]
  m3 : [post_processing]

def pre_processing(data, context): pass

def post_processing(data, context): pass

def ensemble_reducer(data, context): pass


- _requirements.txt_ - This is to support custom Python packages required by workflow handler [if any]. This is an optional file.

**High-level flow for workflow APIs**

![high level design - workflow-apis](https://user-images.githubusercontent.com/26479924/97465555-2dc35000-1968-11eb-8c36-76b7b6db8ece.png)

**High-level components view**

![high level design - workflow-High level design](https://user-images.githubusercontent.com/26479924/97467934-b93de080-196a-11eb-895b-d28240791387.png)
harshbafna commented 3 years ago

Different ensemble scenarios to be covered through workflows described above: https://docs.google.com/spreadsheets/d/1x_Rj5xczANznVRJBaMrU0Wkhy-z-uWasrS4bzkQ7s3c/edit?ts=5f882aa5#gid=0