Closed dathudeptrai closed 2 years ago
Thanks a lot for your interest!
That's a pity, micro batching has not been supported yet. We also agree that clip-as-service will also benefit from micro-batching. The biggest challenge here is to develop a scheduling algorithm to balance the maximum throughput and latency. We need to dig into this algorithm to archive decent performance.
BTW, may I know the target throughput you expect in your project? Maybe we have another way to meet your case.
@numb3r3 Thanks for replying to me quickly.
The biggest challenge here is to develop a scheduling algorithm to balance the maximum throughput and latency
In this aspect, I think we just need options such as max_letency
and max_batch_size
then let users tune those parameters to balance their latency and throughput. We can begin with a simple scheduling algorithm assuming that the number of requests sent to the server is stable.
BTW, may I know the target throughput you expect in your project? Maybe we have another way to meet your case.
In my project, I have a model that only uses 5gb GPU Vram with batch_size <= 8. I also try to simulate a situation where the number of requests is much larger than it should be in reality to check if my GPU Vram increases or not. With batch_size = 8 our model only requires 1.5-2 times inference time with batch_size = 1 without increasing GPU Vram.
Yes, I agree that simple scheduling should be a good starting point. We would seriously think about how to proceed. Regarding your use case, I cannot see the bottleneck, maybe I got a misunderstanding.
@numb3r3 In our case, we start with 1 worker (1 replica) so that our server process requests sequentially, that why we need micro batching. I see in your code, you use async
for encode
function but I'm not sure if deep learning model such as pytorch, tensorflow can run async
?
yes, async
works well with deep learning model. we perform some experiments to confirm async
can benefit the pytorch/tensorflow. And feel free to use replicas, cause they can share to use the single GPU.
@numb3r3 Thanks. I will try to play with async
for deep learning model to see the performance boost :D. Anyway, if jina can support both batching
and async
then we can maximize the utilize of GPU.
we will close this issue for now. If you have some findings to share, you are welcome to share with community. Thanks!
Hi, thanks for the fantastic framework; it helps me greatly in my projects. I want to ask if this framework supports micro batching like bentoml’s here https://docs.bentoml.org/en/0.13-lts/guides/micro_batching.html. I think it improves the server side a lot.