jina-ai / clip-as-service

🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
https://clip-as-service.jina.ai
Other
12.4k stars 2.07k forks source link

Why ZMQ? #70

Closed applenob closed 5 years ago

applenob commented 5 years ago

Hi, what benefit can we receive when we build a model service by using ZMQ? Thanks.

jayjaywg commented 5 years ago

same question. What's the benefits of ZMQ comparing with Tensorflow Serving?

usakey commented 5 years ago

same question, is there any intuition behind the scene for choosing ZMQ here? Thanks.

hanxiao commented 5 years ago

Despite how much I like to read such "why not A instead of B" debate on the web, I personally found it difficult to convince others on this type of question. People who ask this often have preconceived idea about A and B, and it's hard to change people's mind.

Let's first agree on the C/S structure of this design. By decoupling the main BERT model and the downstream network, you gain scalability, reusability and efficiency. For example, your team can share one well-trained BERT model as a feature extractor and deploy it on an powerful and cost-per-use GPU machine. All they have to do is using BertClient to connect with this shared server and do some feedforward prediction, which is not necessarily on GPU or using deep learning. If feature extraction is the bottleneck, then scale up your GPU machines. If downstream network is the bottleneck, then add more CPU machines. If feature is too old or concept-drifted, then retrain your BERT and version-control it, all downstream networks immediately enjoy the updated features. Finally, as all feature extraction requests come to one BertServer, you have less idle cycle on the GPU machine and every penny is spent worthy.

Now let me take a wild guess, you guys mainly concern the communication stack: why not A instead of ZMQ? Despite all three concerns are listed under the same issue, @applenob and @usakey didn't point out the alternative, whereas @jayjaywg clearly asked for the comparison with tf-serving. Note that A can be also RabbitMQ, Apache Kafka, DDS etc. So I wouldn't say you all have the same question.

That being said, I'd like to answer this question in two directions.

Why not A instead of ZMQ, where A is something less popular/backed by smaller community/home-made communication stack?

A: ZMQ allows complex messaging exchange patterns with minimal effort. Personally, I really like how simple its APIs are and how rich patterns can be implemented with just send() and recv(). In this project, I used PUSH-PULL, PUB-SUB, inter-process communication and intra-process communication. In fact, any message patterns I can think of turns into real working code in a few hours. If not, I can always get support from its active community on Stackoverflow or Github.

Just fyi, I'm not the only one who marry ZMQ with tensorflow, here is a more popular one: https://github.com/tensorpack/tensorpack who uses ZMQ for internal communication. In fact, it was him pointed to me first in my blog post.

Why not A instead of ZMQ, where A is something more popular/backed by bigger community, e.g. tf-serving?

A: I don't have much experience of tf-serving, but a few-minute read on its doc seems an overkill to this problem: more dependencies (docker container, etc.), longer round-trip per request, less flexible request batching on the server side. To me, tf-serving is a well-abstracted high-level package which already considers problems that I don't see at the moment. I also believe one can implement most (if not all) features of bert-as-service using tf-serving (even PUB-SUB and "Async encoding"). Nonetheless, I still prefer not to over-engineer the problem and only solve the problems & requirements I see at the moment. After all, the value of a software is determined by whether it solves your problem, not by the tech stack its using. If I want to extend a new feature, say more sophisticated server batching or scheduling, I can always use ZMQ to implement it from the middle level, giving me much clearer picture under the hood.

That said, I'm not stopping anyone to explore alternatives to the ZMQ communication stack. If you have experience in both A and ZMQ, perfect! Then teach me your lessons. Discussions and improvements are the very essence of OSS community. If you find anything better or anything bad of the current repo, you are always welcome to contribute.

usakey commented 5 years ago

@hanxiao Thanks a lot for this detailed explanation.

As my familiar serving stack with tensorflow is tf-serving, so my comparison here with ZMQ is tf-serving, for sure;). I'm writing a tf-serving version bert-as-service for the moment, maybe we can discuss and compare later in details, especially for benchmark part.

loretoparisi commented 5 years ago

I can add my two cents about this. Currently I'm using a Tornado application (so not WSGI) to serve models. Models instances are singleton through the application. In my case I'm using Tornado App because it's completely async, while having WSGI application I could not have any singleton model running on the main thread. So I could move to ZMQ instead of using Tornado, or combining them using @hanxiao multiprocessing approach via ZMQ queueing. That's pretty powerful!

johndpope commented 5 years ago

I used to think grpc would be the best path forward to future path interoperability with trained models and microservices. But with python 2 incompatibility with grpclib - it's an upward battle to get some code working. https://github.com/google/sling/issues/210

zmq - is out of the box state of the art. The youtube videos maybe old - https://www.youtube.com/results?search_query=zmq but the underlying technology was built for stock trading - with high availability / fault tolerance. There's a lot of client libraries supporting different configurations.

hanxiao commented 5 years ago

Starting from 1.5.5, bert-as-server implements the following pipeline:

  1. load graph
  2. freeze (constant-ized all variables)
  3. optimize (remove inference-irrelevant nodes)
  4. serialize
  5. serve graph

, for which you can say is similar to "ExportModel()" or "SavedModelBuilder()" in tf-serving.

You can do pip install -U bert-serving-server bert-serving-client to upgrade.

Benchmarks shows this feature does not affect the inference speed or memory footprint significantly. So if you are hoping switching the communication stack to tf-serving can improve the efficiency, that's negative sign.

Nonetheless, I do see it opens the possibility for more sophisticated graph optimization, e.g. using XLA.