jina-ai / clip-as-service

🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
https://clip-as-service.jina.ai
Other
12.44k stars 2.07k forks source link

Does clip-as-service support micro batching ? #766

Closed dathudeptrai closed 2 years ago

dathudeptrai commented 2 years ago

Hi, thanks for the fantastic framework; it helps me greatly in my projects. I want to ask if this framework supports micro batching like bentoml’s here https://docs.bentoml.org/en/0.13-lts/guides/micro_batching.html. I think it improves the server side a lot.

numb3r3 commented 2 years ago

Thanks a lot for your interest!

That's a pity, micro batching has not been supported yet. We also agree that clip-as-service will also benefit from micro-batching. The biggest challenge here is to develop a scheduling algorithm to balance the maximum throughput and latency. We need to dig into this algorithm to archive decent performance.

BTW, may I know the target throughput you expect in your project? Maybe we have another way to meet your case.

dathudeptrai commented 2 years ago

@numb3r3 Thanks for replying to me quickly.

The biggest challenge here is to develop a scheduling algorithm to balance the maximum throughput and latency

In this aspect, I think we just need options such as max_letency and max_batch_size then let users tune those parameters to balance their latency and throughput. We can begin with a simple scheduling algorithm assuming that the number of requests sent to the server is stable.

BTW, may I know the target throughput you expect in your project? Maybe we have another way to meet your case.

In my project, I have a model that only uses 5gb GPU Vram with batch_size <= 8. I also try to simulate a situation where the number of requests is much larger than it should be in reality to check if my GPU Vram increases or not. With batch_size = 8 our model only requires 1.5-2 times inference time with batch_size = 1 without increasing GPU Vram.

numb3r3 commented 2 years ago

Yes, I agree that simple scheduling should be a good starting point. We would seriously think about how to proceed. Regarding your use case, I cannot see the bottleneck, maybe I got a misunderstanding.

dathudeptrai commented 2 years ago

@numb3r3 In our case, we start with 1 worker (1 replica) so that our server process requests sequentially, that why we need micro batching. I see in your code, you use async for encode function but I'm not sure if deep learning model such as pytorch, tensorflow can run async ?

numb3r3 commented 2 years ago

yes, async works well with deep learning model. we perform some experiments to confirm async can benefit the pytorch/tensorflow. And feel free to use replicas, cause they can share to use the single GPU.

dathudeptrai commented 2 years ago

@numb3r3 Thanks. I will try to play with async for deep learning model to see the performance boost :D. Anyway, if jina can support both batching and async then we can maximize the utilize of GPU.

numb3r3 commented 2 years ago

we will close this issue for now. If you have some findings to share, you are welcome to share with community. Thanks!