Open bfirsh opened 2 years ago
I think this is an elegant API! There's lots of tuning one could do on batch size and timeout, but this is definitely good as an initial version.
Hi Ben, sorry for the delay!
Here is the predictor class we used in AllenNLP: https://github.com/allenai/allennlp/blob/main/allennlp/predictors/predictor.py
There are a couple of design decisions that are relevant for Cog's batch predict functionality.
In the AllenNLP predictor these are separated out into _json_to_instance
and _batch_json_to_instances
(which is implemented, and simply calls _json_to_instance
). _json_to_instance
is the only thing required when a user implements a new predictor.
_batch_json_to_instances
is also overridable because there are some models which generate N model inputs from a single json input. In our case, this was a semantic role labelling model that ran a forward pass of the model per verb in a sentence, so for a single sentence, there were multiple model forward passes that could be run in parallel.
https://github.com/allenai/allennlp/blob/main/allennlp/models/model.py#L193 Here you can see our logic for trying to separate up a batch of predictions. Sometimes this is easy, but also, the user may want a single response for a batch of inputs - for example, in the semantic role labelling example, users wanted an aggregate response across all of the verb predictions. I can imagine this being the case with some image generation models also (e.g timesteps in diffusion models or something).
Here you can see a concrete implementation of this make_output_human_readable
function which by default does nothing:
I think in the cog case, it may be possible to have the _json_to_instance
function inferred from the type reflection, but I would suggest that you provide users a clear route to overriding both the individual _json_to_instance
and _json_to_instances
functionality as well as a (possibly automated) way for them to check it is implemented properly, e.g by passing their output to their predict function.
Something to worth noting, during conversation with andreas
not guranteed
to always benefit effective throughput . It ultimately depends on memeory-boundedness of the model operationsTo actually have this demonstrated more clearly, consider the case where you have a small model that does single matmul with batchsize of 1. In this case, having batchsize of n can be better because you can do operations proportional to number of batch. On the other hand, if you have near-infinite batchsize. your model is going to consume all the time doing compute, and will plateau. However, as you are taking time doing compute your queues will build up, ultimately providing smaller throughput (of course, assuming that autoscaling can kick in to mitigate this)
In short, I think its a really good choice to have built-in batch controller, that detects the "best batchsize" for microbatching pipeline that works regardless of the model or the hardware. Simple algorithm like closed-feedback control, exponential backoff can do this. There is a similar feature at pytorch lightning that can work as a good reference (its done before training to detect the best batchsize for training, but similar context.) (https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html#batch-size-finder)
References: https://aws.amazon.com/ko/blogs/architecture/batch-inference-at-scale-with-amazon-sagemaker/ https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html#batch-size-finder https://en.wikipedia.org/wiki/Exponential_backoff
It is common when deploying ML models to queue up requests to run them on the GPU all at once to increase throughput.
This is distinct from users being able to run several predictions in one go with the API. We are calling this a "bulk" API to disambiguate. GPU batching is purely to make running predictions more efficient on a GPU, and should be independent of how many predictions a user submits. The model should determine how many predictions is needs to pull of the queue to maximize throughput.
Requirements
Potential design
We would probably need to define the input type out of the function signature, unlike a single prediction.
Maybe batch size and timeout should be configurable at runtime, and these are considered defaults?
Future work
Prior art
User data
/cc @DeNeutoy