jina-ai / serve

☁️ Build multimodal AI applications with cloud-native stack
https://jina.ai/serve
Apache License 2.0
21.13k stars 2.22k forks source link

Jina's Client is extremely buggy and slow #5856

Closed Murugurugan closed 1 year ago

Murugurugan commented 1 year ago

I've encountered a problem in my pipeline when I was trying to have some executors be executed at specific moments, like database writes and reads.. etc, but this meant that these executors can't go inside the Flow DAG pipeline, because I don't want them to be executed sequentially, but only when the stars align.. I need to be able to go back and forth in execution, etc. Basically I need a different type of network topology, something with vast freedom and connectivity, like mesh topology of microservices, at least the backend should work like this, frontend can stay the classic Flow.

There seems to be only 2 solutions for this. Either use Flow conditioning or multiple deployments (or multiple flows).

I quickly figured out flow conditioning is probably pretty useless for my use case and multiple deployments require multiple ports, therefore multiple clients as well. That wouldn't be such a problem with regular 3rd party HTTP requests.. if they actually worked. Well they didn't. After 100s of tests I realized where is the problem. Deployments need the flow to actually have exposed entrypoints, probably because they have no gateway. And something like this, won't work:

dep = Deployment(protocol=['http'], port=[8081], uses="quick_test/config.yml", entrypoint="/test")

Anyway, this is where I actually started to lose my mind. Not only that now I had to rethink the whole architecture and make the most logic client-side outside of flows and executors, but I also started to discover that Jina's native Client barely works.

I created a simple deployment like above with executor that only outputs simple text. Then I tested it with this code:

s = time.perf_counter()
c = Client(host="localhost", port=8081, protocol='http')
da = c.post('/test', DocumentArray.empty(1))
print(da.texts)
e = time.perf_counter()
print(e-s)

And what do I see? Correct result, but in 5 seconds, every single time. After getting furious for a moment I tried to change "localhost" to "127.0.0.1" and it did work actually lowering it to 250 ms. Probably just a bug, because Deployments are kinda a new thing if I understand correctly, anyway.. I needed to lower that latency..


So I did try to actually measure what's going on with this code:

c = Client(host='http://127.0.0.1:8081')
c.profiling()

.. which btw also didn't work everytime, sometimes I had to switch to this instead (for some magical reason):

c = Client(host='127.0.0.1:8081', protocol='http')
c.profiling()

..and sometimes even that didn't work, so I had to change the whole constellation so the stars can align, because it mostly works with gRPC only... anyway..


20% of the time it gave me this error:

...
jina.excepts.BadClient: no such endpoint http://127.0.0.1:8081/post

40% of the time it gave me this error:

Traceback (most recent call last):
  File "e:\Muru\WORK\MAIN\SYSTEM\client_latency_test.py", line 20, in <module>
    c.profiling()
  File "E:\Muru\venv\Lib\site-packages\jina\clients\mixin.py", line 190, in profiling
    return _render_response_table(r[0], st, ed, show_table=show_table)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Muru\venv\Lib\site-packages\jina\clients\mixin.py", line 159, in _render_response_table
    'Gateway-executors network', server_network, server_network / gateway_time
                                                 ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
ZeroDivisionError: division by zero

30% of the time it gave me this error:

Traceback (most recent call last):
  File "E:\Muru\WORK\main\SYSTEM\client_latency_test.py", line 19, in <module>
    c.profiling()
  File "E:\Muru\venv\Lib\site-packages\jina\clients\mixin.py", line 190, in profiling
    return _render_response_table(r[0], st, ed, show_table=show_table)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Muru\venv\Lib\site-packages\jina\clients\mixin.py", line 132, in _render_response_table
    route[0].end_time.ToMilliseconds() - route[0].start_time.ToMilliseconds()
    ~~~~~^^^
IndexError: list index out of range

And finally 10% of the time it gave me the actual answer like this:

Roundtrip  79ms  100% 
├──  Client-server network  78ms  99%
└──  Server  1ms  1%
    ├──  Gateway-executors network  1683664004186ms  168366400418600%
    └──  executor  -1683664004185ms  -168366400418500%

Time measured via time.perf_counter(): 104 ms.

This last test was when I tried to make it faster by using gRPC. I didn't manage it to go under 60 ms. And what I don't understand is how can it be this slow? If I understand this correctly, executors are basically communicating instantly in 1 ms using gRPC, but when I send the request from client, it takes so much time, is that normal?

I was confused, so I tested my own python gRPC server/client with normal protobufs and it takes just 2 ms to send and receive message. Why is Client.post so slow then? Even regular "3rd party" HTTP post request was way faster than this native one (that 1% of the time it actually worked) - it took about 30 ms.


Environment

JoanFM commented 1 year ago

Hey @Murugurugan ,

Sorry for the experience, I will look into it ASAP.

Can you tell me what is the exact Deployment you are trying to connect to?

JoanFM commented 1 year ago

My experiment:

I use this in server.py:

import os

os.environ['JINA_LOG_LEVEL'] = 'DEBUG'

from jina import Executor, requests, Deployment

class TextExecutor(Executor):

    @requests
    def post(self, docs, **kwargs):
        for doc in docs:
            doc.text = 'test'

dep = Deployment(protocol='http', port=8081, uses=TextExecutor)

with dep:
    dep.block()

and on client side:

import os
os.environ['JINA_LOG_LEVEL'] = 'DEBUG'

import time

from jina import Client, DocumentArray

s = time.perf_counter()
c = Client(host="localhost", port=8081, protocol='http')
da = c.post('/test', DocumentArray.empty(1))
print(da.texts)
e = time.perf_counter()
print(e-s)

I get all the time 0.07 as result

Murugurugan commented 1 year ago

I've literally took your code, server.py, client.py and changed nothing except of adding if __name__ == '__main__': , otherwise it won't work under Windows. So I changed the server.py like this:

import os

os.environ['JINA_LOG_LEVEL'] = 'DEBUG'

from jina import Executor, requests, Deployment

class TextExecutor(Executor):

    @requests
    def post(self, docs, **kwargs):
        for doc in docs:
            doc.text = 'test'

dep = Deployment(protocol='http', port=8081, uses=TextExecutor)

if __name__ == '__main__':

    with dep:
        dep.block()

And when I ran the server and client. These are my results:

client.py results:

results_client

server.py logs:

server_log

And again, when I change "localhost" to "127.0.0.1", it takes about 140-250 ms.

JoanFM commented 1 year ago

Can you run the following?

import os
os.environ['JINA_LOG_LEVEL'] = 'DEBUG'

import time

from jina import Client, DocumentArray

s = time.perf_counter()
c = Client(host="localhost", port=8081, protocol='http')
da = c.post('/test', DocumentArray.empty(1))
print(da.texts)
e = time.perf_counter()
print(e-s)
for _ in range(100):
    s = time.perf_counter()
    da = c.post('/test', DocumentArray.empty(1))
    print(da.texts)
    e = time.perf_counter()
    print(e-s)

I would like to see if after the first one it works better

Murugurugan commented 1 year ago

Okay, yeah it does get faster, twice as fast.

['test']
4.200061199999936
['test']
2.031936799999812
['test']
2.036665399999947
['test']
2.0299913999999717
['test']
2.038713899999948
['test']
2.0371560999999474
...
JoanFM commented 1 year ago

And what do you get with this?

import os
os.environ['JINA_LOG_LEVEL'] = 'DEBUG'

import time

from jina import Client, DocumentArray

empty_da = DocumentArray.empty(1)
s = time.perf_counter()
c = Client(host="localhost", port=8081, protocol='http')
da = c.post('/test', empty_da)
print(da.texts)
e = time.perf_counter()
print(e-s)

for _ in range(100):
    s = time.perf_counter()
    da = c.post('/test', empty_da)
    print(da.texts)
    e = time.perf_counter()
    print(e-s)

import requests
s = time.perf_counter()

req = {
    "data": [{}],
    "parameters": {}
}
resp = requests.post(url='http://localhost:8081/default', json=req)

print(f' resp {resp.content}')
e = time.perf_counter()
print(e-s)
JoanFM commented 1 year ago

This is what I get in Linux:

['test']
0.10898248699959368
['test']
0.011769082004320808
['test']
0.011453041995991953
['test']
0.010627055002260022
['test']
0.01153672298823949
['test']
0.011217517996556126
['test']
0.011053108013584279
['test']
0.010612120007863268
['test']
0.011045614999602549
['test']
0.011519476989633404
['test']
0.010964082001009956
['test']
0.011012365997885354
['test']
0.011188687989488244
['test']
0.010541165000176989
['test']
0.010776438008178957
['test']
0.011449725992861204
['test']
0.01049561299441848
['test']
0.010651998003595509
['test']
0.011115158995380625
['test']
0.010833978012669832
['test']
0.010872364990063943
['test']
0.011018292992957868
['test']
0.010725624990300275
['test']
0.010813956992933527
['test']
0.011006718996213749
['test']
0.010714828997151926
['test']
0.010924026006250642
['test']
0.010838191010407172
['test']
0.00996307999594137
['test']
0.011684621989843436
['test']
0.011067020997870713
['test']
0.010588671008008532
['test']
0.011267406996921636
['test']
0.011012162009137683
['test']
0.010638407999067567
['test']
0.010643364003044553
['test']
0.01061262100120075
['test']
0.010618576998240314
['test']
0.01041026200982742
['test']
0.011185989002115093
['test']
0.010380391002399847
['test']
0.010836517001735047
['test']
0.01040787500096485
['test']
0.010493641995708458
['test']
0.010610106997773983
['test']
0.010898666005232371
['test']
0.010548381003900431
['test']
0.010603590999380685
['test']
0.01114439099910669
['test']
0.011120484006823972
['test']
0.010301164991687983
['test']
0.010246051009744406
['test']
0.01065213599940762
['test']
0.010105004999786615
['test']
0.01057562899950426
['test']
0.010389155009761453
['test']
0.010957641992717981
['test']
0.010622950998367742
['test']
0.010508103005122393
['test']
0.010567304998403415
['test']
0.010400104001746513
['test']
0.010988701003952883
['test']
0.010670301999198273
['test']
0.010218219002126716
['test']
0.010510945998248644
['test']
0.010379749001003802
['test']
0.009946860998752527
['test']
0.010221970995189622
['test']
0.010163227998418733
['test']
0.011409312995965593
['test']
0.01011558399477508
['test']
0.010344843991333619
['test']
0.010524141995119862
['test']
0.011508722003782168
['test']
0.011767099000280723
['test']
0.010998081008438021
['test']
0.011452699996880256
['test']
0.0114565149997361
['test']
0.010914037004113197
['test']
0.01109352101047989
['test']
0.010872405007830821
['test']
0.01106825600436423
['test']
0.011133032996440306
['test']
0.011137288995087147
['test']
0.011173883991432376
['test']
0.01030526300019119
['test']
0.010830034007085487
['test']
0.010500771008082666
['test']
0.010387678994447924
['test']
0.011549285001819953
['test']
0.010790582004119642
['test']
0.010534425004152581
['test']
0.010570655998890288
['test']
0.010589405006612651
['test']
0.010691502000554465
['test']
0.010578456000075676
['test']
0.010627451003529131
['test']
0.010252741994918324
['test']
0.010317854001186788
['test']
0.011181480003870092
['test']
0.010497372000827454
 resp b'{"data":[{"id":"c3bab1809667ac91e0d6f20fe50bad06","parent_id":null,"granularity":null,"adjacency":null,"blob":null,"tensor":null,"mime_type":null,"text":"test","weight":null,"uri":null,"tags":null,"offset":null,"location":null,"embedding":null,"modality":null,"evaluations":null,"scores":null,"chunks":null,"matches":null}],"parameters":{}}'
0.009073478999198414

The first difference can be explained by the fact that the client at the first run, it reads the OpenAPI to understand the models to send.

Then the rest is really comparable to the usage of requests client.

Murugurugan commented 1 year ago

I am still getting this, but it is true that when I change it to 127.0.0.1, suddenly I am getting results like you.

['test']
4.160628700000416
['test']
2.0312315000001036
['test']
2.03588179999997
['test']
2.036613799999941
['test']
2.0271904999999606
['test']
2.0422684999998637
['test']
2.041764200000216
['test']
2.0247056000002885
['test']
2.042262099999789
['test']
2.0545544000001428
['test']
2.0365921000002345
['test']
2.0321647000000667
['test']
2.0308255999998437
['test']
2.044621499999721
['test']
2.0370342999999593
['test']
2.0252745000002506
['test']
2.034746599999835
['test']
2.0457510999999613
['test']
2.0439958000001752
['test']
2.0342869000000974
['test']
2.0467282999998133
['test']
2.0443199000001187
['test']
2.03281300000026
['test']
2.035817500000121
['test']
2.028013000000101
['test']
2.0286221999999725
['test']
2.039803600000141
['test']
2.0531185999998343
['test']
2.0319737999998324
['test']
2.0282196999996813
['test']
2.028098300000238
['test']
2.0305358999999044
['test']
2.038774999999987
['test']
2.0402257999999165
['test']
2.037547999999788
['test']
2.0338246000001163
['test']
2.0448294000002534
['test']
2.026146499999868
['test']
2.040774699999929
['test']
2.031962999999905
['test']
2.0317370000002484
['test']
2.0392510000001494
['test']
2.039718300000004
['test']
2.0410488999996232
['test']
2.032601899999918
['test']
2.0373814999998103
['test']
2.0408912999996573
['test']
2.041232499999751
['test']
2.0316265000001295
['test']
2.0297842999998466
['test']
2.037377000000106
['test']
2.032652599999892
['test']
2.037771199999952
['test']
2.029459400000178
['test']
2.047274699999889
['test']
2.034747100000004
['test']
2.0306393999999273
['test']
2.0297521000002234
['test']
2.0268701999998484
['test']
2.0357491999998274
['test']
2.029838899999959
['test']
2.036871400000109
['test']
2.028715999999804
['test']
2.0279598000001897
['test']
2.037033699999938
['test']
2.03019119999999
['test']
2.0393371000000116
['test']
2.033587500000067
['test']
2.031843899999785
['test']
2.0348205000000235
['test']
2.0298431000001074
['test']
2.0536080000001675
['test']
2.0222051999999167
['test']
2.0293901000000005
['test']
2.0318482000002405
['test']
2.031179300000076
['test']
2.0378814000000602
['test']
2.0410765000001447
['test']
2.0335102000003644
['test']
2.0338300000003073
['test']
2.035852999999861
['test']
2.0375968999996985
['test']
2.0319910999996864
['test']
2.0361032999999225
['test']
2.045092400000158
['test']
2.0326396000000386
['test']
2.036903799999891
['test']
2.034719199999927
['test']
2.028596000000107
['test']
2.0333110999999917
['test']
2.0356370999998035
['test']
2.028200200000356
['test']
2.0533878999999615
['test']
2.042381900000237
['test']
2.041690099999869
['test']
2.0484427999999753
['test']
2.0273815999998988
['test']
2.04491519999965
['test']
2.03967590000002
['test']
2.066232800000307
['test']
2.045284900000297
 resp b'{"data":[{"id":"4e96a934fa9274b905b685400702a1e9","parent_id":null,"granularity":null,"adjacency":null,"blob":null,"tensor":null,"mime_type":null,"text":"test","weight":null,"uri":null,"tags":null,"offset":null,"location":null,"embedding":null,"modality":null,"evaluations":null,"scores":null,"chunks":null,"matches":null}],"parameters":{}}'
2.0286191000000144
JoanFM commented 1 year ago

I think there must be some inner networking resolution inside your Windows machine I guess.

JoanFM commented 1 year ago

Also I see the results are comparable with simply using requests.

Murugurugan commented 1 year ago

It's definitely possible that there's something going on with my networking, maybe it's because of WSL.

I am glad that the speed is actually much faster when sending multiple requests, I hope it works in 1 session for the whole time the client script is running. I managed to get the gRPC speed to "good enough" at around 16 ms. But in my test bench, gRPC via protobufs is still running at 2 ms.

Is there any way I could make that speed up with Jina, or is it always gonna be inevitably slower with added complexity, even if I made my own custom solution without Jina?

JoanFM commented 1 year ago

The usage of WSL may impact for sure.

Can I see what is the gRPC solution u are doing?

Murugurugan commented 1 year ago

It's basically just the tutorial thing, very basic gRPC communication like this:

client.py:

import os
import grpc
from communication_pb2_grpc import RecommendationsStub
from communication_pb2 import BookCategory, RecommendationRequest
import time

channel = grpc.insecure_channel("127.0.0.1:50051")

client = RecommendationsStub(channel)

for _ in range(100):
    s = time.perf_counter()
    request = RecommendationRequest(user_id=1, category=BookCategory.MYSTERY, max_results=1)
    dd = client.Recommend(request)
    e = time.perf_counter()
    print(e-s)
    print(dd)
    time.sleep(.1)

server.py:

from concurrent import futures

import random
import grpc

from communication_pb2 import BookCategory, BookRecommendation, RecommendationResponse

import communication_pb2_grpc

books_by_category = {

    BookCategory.MYSTERY: [

        BookRecommendation(id=1, title="The Maltese Falcon"),

        BookRecommendation(id=2, title="Murder on the Orient Express"),

        BookRecommendation(id=3, title="The Hound of the Baskervilles"),

    ],

    BookCategory.SCIENCE_FICTION: [

        BookRecommendation(

            id=4, title="The Hitchhiker's Guide to the Galaxy"

        ),

        BookRecommendation(id=5, title="Ender's Game"),

        BookRecommendation(id=6, title="The Dune Chronicles"),

    ],

    BookCategory.SELF_HELP: [

        BookRecommendation(

            id=7, title="The 7 Habits of Highly Effective People"

        ),

        BookRecommendation(

            id=8, title="How to Win Friends and Influence People"

        ),

        BookRecommendation(id=9, title="Man's Search for Meaning"),

    ],

}

class RecommendationService(communication_pb2_grpc.RecommendationsServicer):

    def Recommend(self, request, context):

        print(request)

        if request.category not in books_by_category:

            context.abort(grpc.StatusCode.NOT_FOUND, "Category not found")

        books_for_category = books_by_category[request.category]

        num_results = min(request.max_results, len(books_for_category))

        books_to_recommend = random.sample(books_for_category, num_results)

        return RecommendationResponse(recommendations=books_to_recommend)

def serve():

    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))

    communication_pb2_grpc.add_RecommendationsServicer_to_server(RecommendationService(), server)

    server.add_insecure_port("[::]:50051")

    server.start()

    server.wait_for_termination()

if __name__ == "__main__":

    serve()
JoanFM commented 1 year ago

And what is the comparision level you get?

Murugurugan commented 1 year ago

Comparison is that Jina is 8x slower - 16 ms vs 2 ms. And I am not sure whether it's due to added necessary complexity or due to badly optimized code.

JoanFM commented 1 year ago

Also, can u tell me how to get these communtcation_pb2 and so on? What is the protobuf? or the already created one?

JoanFM commented 1 year ago

Can you try with this code?

import os

os.environ['JINA_LOG_LEVEL'] = 'DEBUG'

from jina import Executor, requests, Deployment

class TextExecutor(Executor):

    @requests
    def post(self, docs, **kwargs):
        for doc in docs:
            doc.text = 'test'

dep = Deployment(protocol='grpc', port=8081, uses=TextExecutor, include_gateway=False)

if __name__ == '__main__':

    with dep:
        dep.block()

I can explain later if it shows better results

Murugurugan commented 1 year ago

The communication_pb2 are generated from protobufs like this for example:

protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/addressbook.proto

protofile it generates from:


syntax = "proto3";

enum BookCategory {

    MYSTERY = 0;

    SCIENCE_FICTION = 1;

    SELF_HELP = 2;

}

message RecommendationRequest {

    int32 user_id = 1;

    BookCategory category = 2;

    int32 max_results = 3;

}

message BookRecommendation {

    int32 id = 1;

    string title = 2;

}

message RecommendationResponse {

    repeated BookRecommendation recommendations = 1;

}

service Recommendations {

    rpc Recommend (RecommendationRequest) returns (RecommendationResponse);

}

And I also thought it could be the gateway that slows that down, but running your example shows no difference at latency. Well, it's actually maybe 2 ms faster.

['test']
0.060913799999980256
DEBUG  GRPCClient@22220 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.013816599999699974
DEBUG  GRPCClient@22220 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.013448499999867636
DEBUG  GRPCClient@22220 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.01263950000065961
DEBUG  GRPCClient@22220 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.01279919999979029
DEBUG  GRPCClient@22220 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.012876999999207328
DEBUG  GRPCClient@22220 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.013945500000772881
DEBUG  GRPCClient@22220 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.011934099999052705
DEBUG  GRPCClient@22220 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.013134599999830243
DEBUG  GRPCClient@22220 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.012195699999210774
DEBUG  GRPCClient@22220 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.014174399999319576
JoanFM commented 1 year ago

can you try with stream=False in the client.post method?

JoanFM commented 1 year ago

I would suggest also you can run your experiment with Jina's own Protofile.

Murugurugan commented 1 year ago

With stream=False it's even slower:

['test']
0.06436949999988428
DEBUG  GRPCClient@18496 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.02051469999969413
DEBUG  GRPCClient@18496 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.017628899999181158
DEBUG  GRPCClient@18496 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.02200789999915287
DEBUG  GRPCClient@18496 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.0188428000001295
DEBUG  GRPCClient@18496 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.01742790000025707
DEBUG  GRPCClient@18496 connected to 127.0.0.1:8081                                                                                                                                      
['test']
0.01647409999895899
DEBUG  GRPCClient@18496 connected to 127.0.0.1:8081                                                                                                                   [05/10/23 13:31:27]
['test']
0.021625999999741907
JoanFM commented 1 year ago

But I would name here things that I see that are different and could explain discrepancies:

I believe the combination of 1 and 3 can explain most of the penalty.

Murugurugan commented 1 year ago

Yeah I guess that makes sense. It's the unfortunate price we have to pay for. I can still try to make mix of Jina's executors and protobuf apps, although I am not really sure how would that glue together.

But my biggest hope I have is for when Mojo's programming language will come out and we will be able to convert all the code to make it several times faster, probably including gRPC itself. I think that will actually solve all the problems easily and it will be here pretty soon so I have no worries.

Thanks for all the help!