Batch predictions to run on GPU

bfirsh commented 2 years ago

It is common when deploying ML models to queue up requests to run them on the GPU all at once to increase throughput.

This is distinct from users being able to run several predictions in one go with the API. We are calling this a "bulk" API to disambiguate. GPU batching is purely to make running predictions more efficient on a GPU, and should be independent of how many predictions a user submits. The model should determine how many predictions is needs to pull of the queue to maximize throughput.

Requirements

Allow the user to define a predict function that can take multiple inputs. A batch prediction is a superset of a single prediction, so we may as well just allow one or the other. It means you get better performance, at the expense of it being harder to implement. "If you want to use batching, adapt your predict function to take a list of inputs."
Queue up HTTP/Redis requests. There is a trade-off between latency and throughput. Most serving systems seem to let the user define a max batch size and a max latency/timeout. But, that will mean predictions will always take at least the timeout unless the queues are saturated. Maybe there's something more clever we can do there?

Potential design

We would probably need to define the input type out of the function signature, unlike a single prediction.

class InputObj(BaseModel):
    image: File
    another_param: float = Input(default=0.26)

class Predictor(BasePredictor):
    def predict(self, inputs: List[InputObj] = Batch(batch_size=25, timeout=200)):
        processed_inputs = map(inputs, preprocess)
        return self.model(processed_inputs)

Maybe batch size and timeout should be configurable at runtime, and these are considered defaults?

Future work

Pre-/post-processing is often done on CPU so could be done separately.

Prior art

User data

@daanelson says this would make language models much faster.

/cc @DeNeutoy

andreasjansson commented 2 years ago

I think this is an elegant API! There's lots of tuning one could do on batch size and timeout, but this is definitely good as an initial version.

DeNeutoy commented 2 years ago

Hi Ben, sorry for the delay!

Here is the predictor class we used in AllenNLP: https://github.com/allenai/allennlp/blob/main/allennlp/predictors/predictor.py

There are a couple of design decisions that are relevant for Cog's batch predict functionality.

How should you split up the data processing from batch prediction (actually running the model on N instances).

In the AllenNLP predictor these are separated out into _json_to_instance and _batch_json_to_instances (which is implemented, and simply calls _json_to_instance). _json_to_instance is the only thing required when a user implements a new predictor.

_batch_json_to_instances is also overridable because there are some models which generate N model inputs from a single json input. In our case, this was a semantic role labelling model that ran a forward pass of the model per verb in a sentence, so for a single sentence, there were multiple model forward passes that could be run in parallel.

How the model separates the batch predictions into single json responses

https://github.com/allenai/allennlp/blob/main/allennlp/models/model.py#L193 Here you can see our logic for trying to separate up a batch of predictions. Sometimes this is easy, but also, the user may want a single response for a batch of inputs - for example, in the semantic role labelling example, users wanted an aggregate response across all of the verb predictions. I can imagine this being the case with some image generation models also (e.g timesteps in diffusion models or something).

Here you can see a concrete implementation of this make_output_human_readable function which by default does nothing:

https://github.com/allenai/allennlp/blob/a6271a319cb92f73e798cf3a33ac49ee18cd7cc5/allennlp/models/simple_tagger.py#L182

I think in the cog case, it may be possible to have the _json_to_instance function inferred from the type reflection, but I would suggest that you provide users a clear route to overriding both the individual _json_to_instance and _json_to_instances functionality as well as a (possibly automated) way for them to check it is implemented properly, e.g by passing their output to their predict function.

cloneofsimo commented 1 year ago

Something to worth noting, during conversation with andreas

Microbatching is not guranteed to always benefit effective throughput . It ultimately depends on memeory-boundedness of the model operations

To actually have this demonstrated more clearly, consider the case where you have a small model that does single matmul with batchsize of 1. In this case, having batchsize of n can be better because you can do operations proportional to number of batch. On the other hand, if you have near-infinite batchsize. your model is going to consume all the time doing compute, and will plateau. However, as you are taking time doing compute your queues will build up, ultimately providing smaller throughput (of course, assuming that autoscaling can kick in to mitigate this)

In short, I think its a really good choice to have built-in batch controller, that detects the "best batchsize" for microbatching pipeline that works regardless of the model or the hardware. Simple algorithm like closed-feedback control, exponential backoff can do this. There is a similar feature at pytorch lightning that can work as a good reference (its done before training to detect the best batchsize for training, but similar context.) (https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html#batch-size-finder)

References: https://aws.amazon.com/ko/blogs/architecture/batch-inference-at-scale-with-amazon-sagemaker/ https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html#batch-size-finder https://en.wikipedia.org/wiki/Exponential_backoff

replicate / cog