prashantsail commented 4 years ago

Is your feature request related to a problem? Please describe.

Currently to implement ensemble of models, the user needs to create a pipeline and manage data flow on the client side.
This introduces an overhead of multiple API requests, slowing down the overall output.

Describe the solution

Ensembling could be achieved at the server side by providing a simple configuration which represents the pipeline for data flow.
This will reduce the number of requests being sent by the end user and should result result in faster overall output

prashantsail commented 4 years ago

Support for Ensemble Models in TorchServe

What are ensemble models?

- Ensemble model represents a pipeline of models and the flow of data between them.
- This can reduce the overhead of transferring intermediate tensors and minimize the number of requests that must be sent to torchserve.

Approach

Ensemble will be a logical model representing a pipeline of required models.
From an end user's perspective –
- A single ensemble model will be registered.
- Execute sync inference requests and get corresponding responses.
- The pipeline of models along with the intermediate tensors are a blackbox.
Internally though, the Frontend layer orchestrates request\response to the models in the pipeline.
Ensemble model will not interfere with configuration\updating of models in the pipeline.

Impact on existing frontend APIs

Registration API

All models in the ensemble have to be independently registered first.
All models in the ensemble have to be scaled up to at least 1 worker.
Being a logical model, we need a small tweak to api parameters during registration
1. ensemble : (New) A json array representing a pipeline of models
2. model_name: Name of the model
3. batch_size: Inference batch size
4. max_batch_delay: Maximum delay for batch aggregation
5. response_timeout: inference response within this timeout period
6. handler: Ignored as all orchestration will be in frontend layer
7. url: Ignored as mar not needed
8. initial_workers: Ignored as this model will not have any workers
9. synchronous: Ignored as this model will not have any workers

Inference API

Checks if all the models specified in the pipeline configuration are registered and have at least one active worker.
Frontend will orchestrate request\response from models in the pipeline.
Example: Inferencing with an ensemble model

M1, M2, M3 are pre-registered models.
- A simple JSON config representing this pipeline: {“ensemble”: [“M1”, “M2”, “M3”]}
- Frontend is responsible for :
  1. Collecting the _input_from the user's request (for ensemble model) and initiating a request to M1.
  2. Collecting output from M1 and initiating a request to M2.
  3. Collecting output from M2 and initiating a request to M3.
  4. Collecting output from M3 is and sending it back as a response to the end user.

Scale Up API

Not Applicable \ Ignored as the ensemble model will not have any worker

Scale Down API

Not Applicable \ Ignored as the ensemble model will not have any worker

Describe Model API

Info of the ensembled model

Unregister API

Removes the ensemble model

Points to Discuss

Topic	Suggestion
Should ensemble model control the working\updating of the models in the pipeline? Example - Auto registration of models if not already present Auto scale up of models Blocking scale down of models Blocking unregistration of models	No, ensemble model should not interfere with working\updating of the models. Models in pipeline will be controlled by their individual management APIs If models in the pipeline are unregistered or scaled down to 0, then the inference request on the ensemble model will error out with an appropriate message in the response.
Should we support responding with outputs from intermediate models?	Yes, this should be an optional feature though.
Should we support nesting of ensemble models?	Yes, this could be a secondary feature based on the effort required.
Should we support pipeline structures similar to those shown below – Support parallel processing of M2 and M3 Support for multiple outputs	We could take this up in Phase II, as this will overtly increase complexity and effort required for this feature.

prashantsail commented 4 years ago

@chauhang @dhanainme @maaquib - This is the approach we are thinking of. Let us know your thoughts.

chauhang commented 4 years ago

Thanks @prashantsail for putting this together.

Can you also describe how the support for different pipelining options will be handled in future releases? Are you thinking of having something like the workflows for the AWS Lambda Step function?
How will the batching get handled for the Ensemble models?
What provision will be there for debugging the ensemble pipeline?
Are we going to add new metrics for ensemble case?

harshbafna commented 4 years ago

@chauhang Please see the comments below

Can you also describe how the support for different pipelining options will be handled in future releases? Are you thinking of having something like the workflows for the AWS Lambda Step function?

The user will define a pipeline/workflow while registering the ensemble model. We are thinking of following two approaches for how an end-user interfaces with TorchServe for registering an ensemble model. The ensemble model will be registered as a logical model.

Approach A :

Use the already existing model registration API with a new parameter named ensemble which takes a JSON data defining the pipeline. The registration model will ignore all other existing params except the model name in case the ensemble parameter is supplied.

e.g. POST /models?model_name=xyx&ensemble={[[“M1:v1”, “M2:v2”, “M3:v3”]]}

Pros :

Reuse existing APIs Cons :
The existing flow will require mar file and the new flow will not require a mar file ignore some of the existing parameters.

Approach B :

Add new set of APIs to define a workflow which takes model name and flow as input.

E.g. POST /models/workflow?workflow_name=&flow=={[m1:v1->m2:v2->m3..]} DEL /models/workflow/workflow_name GET /models/workflow

Pros :

Clear separation between APIs

Cons :

Extra set of APIs for workflow/pipeline management
it introduces new workflow/ensembly resource to be managed

How will the batching get handled for the Ensemble models?

There can be again two approaches here :

Approach A: Supply batch_size while registering the ensemble model and update the batch size of every model in the pipeline before running the inference just to ensure every model uses the same batch size.

Approach B: Ensemble model doesn't support batching, instead depends on the configured batch size per model in the pipeline.

What provision will be there for debugging the ensemble pipeline?

In case of failure, the inference response will provide the model name and related error from the pipeline where the inference failed. We will also see if we can return the intermediate output of the last model which completed successful inference. We will also enhance the logs to represent if it is for an API call on the normal model or ensemble model.

Are we going to add new metrics for the ensemble case?

The pipeline will be treated as a logical model and the existing metric mechanism will be reused, where it will return metric data for the ensemble model as a whole and for individual models in the pipeline as well. Note: The flow will be broken into sequential inference jobs by TorchServe

harshbafna commented 4 years ago

@chauhang Please ignore the previous comment. We had an internal discussion today, there may be a different way to approach this. We will update this ticket in a day or two.

dhaniram-kshirsagar commented 4 years ago

@chauhang @lokeshgupta1975 @dhanainme @maaquib Based on the internal discussion, here are the updated approaches for ensemble support in TorchServe

Scope

Ensembles of models will support parallel inferencing with multiple models followed by an ensemble function(expressed in post-processing code) to return the inferred output.

Out of scope

It will not support ensembles where models are created in memory (such as bagging and boosting). This would go against TS philosophy of registering models prior to use.
It will not support the ensemble with multiple outputs
It will not support any serial ensemble where the input of one model is given to another model for final output
In general, it will not support the building of a general-purpose workflow dag.

Design considerations

An Ensemble is a logical model without the corresponding PyTorch model file(s) (pt/pth) in its mar file.
Models participating in an ensemble should be registered apriori. Otherwise, the system will flag an error.
Current model registration is equivalent to registering an ensemble with a single model having pre and post-processing. However, for multi-model ensembles, the mar manifest will be enhanced to include additional parameters describe below. Consequently, this will result in an architecture that is backward compatible. (non-ensemble scenarios)
As it is a model, the user will be able to specify batch related parameters, scale-up/down required workers.
All models within ensemble models are expected to process the same input data format.
Ensemble model MAR file will include additional required elements
- Ensemble name
- Constituent models and associated versions
- Pre-processing/Post processing code. It is expected that the end-user would write the ensemble logic in the post-processing code.
An ensemble model would need to co-ordinate results received from the constituent models. The design should minimize network traffic and serialization/deserialization compute between all moving parts.

Proposed Approach(s)

We are proposing the following design approaches to support ensemble in TorchServe.

Approach-1 (Recommended approach)

In this approach, the ensemble model orchestration will be handled via the system provided ensemble (default) handler itself. This is our recommended approach.

Pros
- No major changes to APIs and internal architecture of TorchServe
- The ensemble handler (python) is responsible for spinning up the workers. The workers can be instantiated as
- Processes or
- Threads
- The ensemble handler will also coordinate results received from these workers
- Less overhead with respect to network traffic and in turn with data serialization/deserialization. In this approach, there is exactly one data transfer from frontend to ensemble handler and one response from ensemble handler to frontend.
Cons
- Replication of some parts of front-end functionality to ensemble handler
- GIL limitations if the ensemble handler uses a threading strategy to spin up workers
High-level design

approach-2-design

Model Registration Flow

approach-2-reg-flow

Inference Flow

approach-2-inf-flow

Approach-2

In this approach, TorchServe frontend layer will act as an orchestrator for ensemble model lifecycle management and inference. This approach will use the existing handler framework for loading the models present in the ensemble.

Pros
- Reuses existing framework to manage models in the ensemble
Cons
- Requires coordinator in frontend to handle traffic between ensemble model workers and frontend.
- Overhead with respect to network traffic and in turn with data serialization/deserialization due to to/from communication between ensemble workers and frontend.
- A special ensemble worker to only provide the pre and post/ensemble processing function.
High-level design

approach-1-design

Model Registration Flow

approach-1-reg-flow

Inference Flow

approach-1-inf-flow

lxning commented 4 years ago

Should the ensemble model require each internal model's batch size must be 1 if client side batches multiple requests together and sends the request to ensemble model?
Can ensemble model requests to be mixed with the regular model's request in each model's request batch?

harshbafna commented 4 years ago

@lxning

Should the ensemble model require each internal model's batch size must be 1 if the client-side batches multiple requests together and sends the request to the ensemble model?

The batching for any model is done at TorchServe's frontend layer. The ensemble model's every internal model will use the same batch size that is specified for the ensemble model at the time of registration.

2. Can the ensemble model request to be mixed with the regular model's request in each model's request batch?

The model's workers for the ensemble model will be independent of the regular workers and will only serve the ensemble model requests.

Any model used in the ensemble can have its own independent workers for serving regular inference request.

dhaniram-kshirsagar commented 3 years ago

As discussed with @chauhang @lokeshgupta1975 @dhanainme @maaquib @harshbafna, here is the final plan for adding workflow support to torchserve.

Goal - Able to support the ensemble of models.

Assumptions/notes

In this design ensemble is described by a workflow i.e. either a parallel execution of the model with aggregation function OR sequential/pipeline flow OR any other custom flow with one or more output from different models
The phase-1 implementation will support pipeline/sequential workflow only.
Workflow will have a separate set of CRUD APIs
Workflow will support both REST and gRPC APIs
Workflow registration API will provide means for specifying workflow pre-post functions
Workflow pre and post-processing will be specified via python script/module
Workflow API will support the specification of different custom flows including pipeline, ensemble, multi-output, etc.
Workflow engine will have its own set of models and associated workers which will be internal to torch serve i.e. inaccessible via model mgmt/inference/metric APIs
Workflow models for parallel execution [top layer models in a flow] will accept the same type of input
Workflow will not support any kind of job scheduling
Workflow engine will be added as a frontend module to torchserve
Backend will be enhanced to launch workflow specific backend workers for pre and post-processing of workflow input and output
Workflow is a higher-level function and will use existing model management and inference methods
Workflow will not support specifying Input or output per model
Snapshot feature for workflow will not be supported
Generic workflow framework will be in place however it will be tested for pipeline/sequential flows
Workflow will not have any versioning support
Workflow models will not be part of the torchserve snapshot
Only registration and update APIs will support async request mode
Performance benchmarking will not be part of this release
Workflow work will be based on the gRPC branch [if not merged to master]

Based on the above design constraints/notes, this will be introducing a new component for creating workflow archives apart from internal modules such as WFManager and WFExecutor.

Workflow Archiver [WAR] - This will be an independent CLI utility similar to the torchserve model archiver [MAR] utility. The main purpose of this component is to create WAR file using supplied workflow specification [yaml] file, .py [or python modules], and requirements.txt files.

Workflow specification - This describes the flow that the user is willing to execute through the torchserve workflow engine. It will be YAML file e.g.

models:
    #global model params
    min-workers: 1
    max-workers: 4
    batch-size: 8
    m1:
       url : model1.mar #local or S3 path
       min-workers: 1   #override the global params
       max-workers: 2
       batch-size: 4

    m2:
       url : model2.mar

    m3:
       url : model3.mar
       batch-size: 2

    m4:
      url : model4.mar

dag: #can have only one start node and one end node
  pre_processing : [m1, m3]
  m1 : [m2, m4]
  m2 : [post_processing]
  m4 : [post_processing]
  m3 : [post_processing]

Workflow handler - This will be a python file having different functions that can be invoked as part of the workflow. e.g. pre or post-processing or ensemble aggregate function etc..
```
"""
Add all your entry level function which needs to be executed via workflow DAG.
"""
```

def pre_processing(data, context): pass

def post_processing(data, context): pass

def ensemble_reducer(data, context): pass


- _requirements.txt_ - This is to support custom Python packages required by workflow handler [if any]. This is an optional file.

**High-level flow for workflow APIs**

![high level design - workflow-apis](https://user-images.githubusercontent.com/26479924/97465555-2dc35000-1968-11eb-8c36-76b7b6db8ece.png)

**High-level components view**

![high level design - workflow-High level design](https://user-images.githubusercontent.com/26479924/97467934-b93de080-196a-11eb-895b-d28240791387.png)

harshbafna commented 3 years ago

Different ensemble scenarios to be covered through workflows described above: https://docs.google.com/spreadsheets/d/1x_Rj5xczANznVRJBaMrU0Wkhy-z-uWasrS4bzkQ7s3c/edit?ts=5f882aa5#gid=0

pytorch / serve

Support for ensemble of models served in torchserve #682

Is your feature request related to a problem? Please describe.

Describe the solution

Support for Ensemble Models in TorchServe

What are ensemble models?

Approach

Impact on existing frontend APIs

Registration API

Inference API

Scale Up API

Scale Down API

Describe Model API

Unregister API

Points to Discuss