megaease / easemesh

A service mesh implementation for connecting, control, and observe services in spring-cloud.
https://megaease.com/easemesh
Apache License 2.0
507 stars 61 forks source link

JavaAgent has higher latency in the Mesh's data plane #9

Closed zhao-kun closed 3 years ago

zhao-kun commented 3 years ago

After the java agent's observability worked, we can observe the latency between two services. In our environment, we have the following invocation diagram.

                                   ┌───────────────────────────┐
                                   │                           │                   ┌──────────────┐
                    (1)            │                           │        (2)        │              │
             ┌─────────────────────►      mesh-app-backend     ├───────────────────►    db-mysql  │
             │                     │         /users/{id}       │                   │              │
             │                     │                           │                   └──────────────┘
             │                     └───────────────────────────┘
             │
┌────────────┴────────────┐
│                         │
│      mesh-app-front     │
│ /front/userFullInfo/{id}│
│                         │
└────────────┬────────────┘
             │                     ┌───────────────────────────┐
             │                     │                           │
             │      (3)            │                           │
             └─────────────────────►    mesh-app-performance   │
                                   │      /userStatus/{id}     │
                                   │                           │
                                   └───────────────────────────┘

I pick a tracing recording, I found the latency between mesh-app-frontend and mesh-app-backend service is higher.

Type Start Time Relative Time Address
Client Start 03/29 11:24:58.167_468 441μs 10.233.111.77 (mesh-app-frontend)
Server Start 03/29 11:24:58.183_069 16.042ms 10.233.67.33 (mesh-app-backend)
Server Finish 03/29 11:24:58.192_037 25.010ms 10.233.67.33 (mesh-app-backend)
Client Finish 03/29 11:24:58.193_820 26.793ms 10.233.111.77 (mesh-app-frontend)

image

The first section between the two white spots is the communication latency client service (mesh-app-frontend) to server service (mesh-app-backend) . It's apparently too high, about occupies 50% latency of the request.

zhao-kun commented 3 years ago

trace ID: 0afe403643632be1

Reproduce: Enter the our portal: https://km.megaease.cn/megacloud/app (XXXXX) Select [Service -> Records ->Trace ID (Input 0afe403643632be1) -> Select correct timerange (2021/03/29 10:00 - 2021/03/29 11:30) ->click search] Click the second span to check the detail annotations information

zhao-kun commented 3 years ago

According to @xxx7xxxx inputs, he tested /user/1 interface of ease-app-backend in the container loopback network. Got following results:

curl -w "@curl-format.txt" http://127.0.0.1:39527/users/1
{"id":1,"username":"zhao-kun2","type":null,"email":"zhao-kun2@mysql.com","status":null,"department":"IT"}     
     time_namelookup:  0.000083s
        time_connect:  0.000732s
     time_appconnect:  0.000000s
    time_pretransfer:  0.000832s
       time_redirect:  0.000000s
  time_starttransfer:  0.021994s
                     ----------
          time_total:  0.024681s
zhao-kun commented 3 years ago

For helping us diagnose the issue, I pick a none database accessing /userStatus/1 interface of mesh-app-performance to test. I tested it in two way:

 [41] 🐳  → curl -w "@/root/curl-format.txt" http://127.0.0.1:39527/userStatus/1
user 1 status is normal
            time_namelookup:  95
               time_connect:  424
            time_appconnect:  0
           time_pretransfer:  552
              time_redirect:  0
         time_starttransfer:  12654
                            ----------
                 time_total:  12757

> curl-format.txt

\n time_namelookup: %{time_namelookup}\n time_connect: %{time_connect}\n time_appconnect: %{time_appconnect}\n time_pretransfer: %{time_pretransfer}\n time_redirect: %{time_redirect}\n time_starttransfer: %{time_starttransfer}\n ----------\n time_total: %{time_total}\n \n

akwei commented 3 years ago

first inovke

curl info

curl -w "@/Users/akwei/temp/curl-format.txt" http://127.0.0.1:39526/userStatus/1
user 1 status is normal
            time_namelookup:  0.000111
               time_connect:  0.000490
            time_appconnect:  0.000000
           time_pretransfer:  0.000543
              time_redirect:  0.000000
         time_starttransfer:  0.391746
                            ----------
                 time_total:  0.391778

agent time info:

method-before:22ms, method-after:255ms

second invoke

curl info

curl -w "@/Users/akwei/temp/curl-format.txt" http://127.0.0.1:39526/userStatus/1
user 1 status is normal
            time_namelookup:  0.000119
               time_connect:  0.000662
            time_appconnect:  0.000000
           time_pretransfer:  0.000730
              time_redirect:  0.000000
         time_starttransfer:  0.010690
                            ----------
                 time_total:  0.010746

agent time info:

method-before:1ms, method-after:1ms

When I invoke multi times, I found agent time info:

method-before:4ms, method-after:1ms
method-before:1ms, method-after:1ms
method-before:1ms, method-after:2ms

When I invoke use sql,

curl info:

curl -w "@/Users/akwei/temp/curl-format.txt" http://localhost:39527/front/users/1
{"id":1,"username":"akwei","type":null,"email":"akwei@mail.com","status":null,"department":"ui"}
            time_namelookup:  0.002375
               time_connect:  0.002727
            time_appconnect:  0.000000
           time_pretransfer:  0.002772
              time_redirect:  0.000000
         time_starttransfer:  0.051289
                            ----------
                 time_total:  0.062715

agent info:

HikariDataSource$$EnhancerBySpringCGLIB$$33a61729#getConnection  method-before:0ms, method-after:0ms
HikariProxyPreparedStatement#execute  method-before:2ms, method-after:1ms

It seems that Agent did not consume a lot of time.

zhao-kun commented 3 years ago

method-before:22ms, method-after:255ms

Typos?

zhao-kun commented 3 years ago

What's the meaning of method-before and method-after?

akwei commented 3 years ago

Explain: method-before : before method invoke method-after: after method invoke

I will fix it read friendly

zhao-kun commented 3 years ago

Explain: method-before : before method invoke method-after: after method invoke

I will fix it read friendly

Does it means that the agent consumes time is method-before + method-after

akwei commented 3 years ago

Yes. This is on my computer.

benja-wu commented 3 years ago

Tcpdump pkg

Details

03:37:01.349792 lo  In IP 127.0.0.1.39527 > 127.0.0.1.58166: Flags [P.], seq 465:692, ack 339, win 86, options [nop,nop,TS val 354688258 ecr 354688227], length 227 0x0000: 4500 0117 56dc 4000 4006 e502 7f00 0001 E...V.@.@....... 0x0010: 7f00 0001 9a67 e336 1aa0 91a8 a558 fab7 .....g.6.....X.. 0x0020: 8018 0056 ff0b 0000 0101 080a 1524 1d02 ...V.........$.. 0x0030: 1524 1ce3 4854 5450 2f31 2e31 2032 3030 .$..HTTP/1.1.200 0x0040: 200d 0a43 6f6e 7465 6e74 2d54 7970 653a ...Content-Type: 0x0050: 2061 7070 6c69 6361 7469 6f6e 2f6a 736f .application/jso 0x0060: 6e0d 0a54 7261 6e73 6665 722d 456e 636f n..Transfer-Enco 0x0070: 6469 6e67 3a20 6368 756e 6b65 640d 0a44 ding:.chunked..D 0x0080: 6174 653a 2057 6564 2c20 3331 204d 6172 ate:.Wed,.31.Mar 0x0090: 2032 3032 3120 3033 3a33 373a 3031 2047 .2021.03:37:01.G 0x00a0: 4d54 0d0a 0d0a 3662 0d0a 5b7b 2269 6422 MT....6b..[{“id” 0x00b0: 3a31 2c22 7573 6572 6e61 6d65 223a 227a :1,“username”:“z 0x00c0: 6861 6f2d 6b75 6e32 222c 2274 7970 6522 hao-kun2”,“type” 0x00d0: 3a6e 756c 6c2c 2265 6d61 696c 223a 227a :null,“email”:“z 0x00e0: 6861 6f2d 6b75 6e32 406d 7973 716c 2e63 hao-kun2@mysql.c 0x00f0: 6f6d 222c 2273 7461 7475 7322 3a6e 756c om”,“status”:nul 0x0100: 6c2c 2264 6570 6172 746d 656e 7422 3a22 l,“department”:”

04:09:40.512059 lo In IP 127.0.0.1.50222 > 127.0.0.1.39527: Flags [P.], seq 474:596, ack 963, win 86, options [nop,nop,TS val 356647403 ecr 356646443], length 122

04:09:40.538563 lo In IP 127.0.0.1.39527 > 127.0.0.1.50222: Flags [P.], seq 963:1209, ack 596, win 86, options [nop,nop,TS val 356647429 ecr 356647403], length 246


In `mesh-app-backend-v1-5656c6b4f4-pjq5m       2/2     Running   0          43m   10.233.111.105   megaease-km4   <none>   ` instance, it seems that local Java process consumes nearly 10 ~ 30 ms to reply the package to backend's Ingress. Is this a normal time consuming for logic processing? 

## Analysis cmd:
`megaease-km4:➜  ~  |>sudo docker run -it  --net container:1ded7018ce94 nicolaka/netshoot `
zhao-kun commented 3 years ago

”Where Has the Time Gone“

:sweat_smile: