Prod fleet slows down - Githubissues

jm-clius commented 2 years ago

The following is observed on the wakuv2.prod fleet (2022-05-16 8 UTC):

CPU Usage on node-01.ac-cn-hongkong-c.wakuv2.prod and node-01.gc-us-central1-a.wakuv2.prod has increased to 30% and 60% respectively.
Connections to either these two nodes are either slow or fails after a timeout
No visible increase in memory usage or evidence of swapping (yet). Also no obvious correlation with increase in traffic or connections. Will investigate this further using more sources/metrics/logs.
node-01.do-ams3.wakuv2.prod remains seemingly unaffected (for now)

jm-clius commented 2 years ago

cc @jakubgs Moved from nwaku repo

jakubgs commented 2 years ago

Something happened at midnight:

We can see a major drop in free ram:

This is on node-01.ac-cn-hongkong-c.wakuv2.prod.

jakubgs commented 2 years ago

We can see the same thing on node-01.gc-us-central1-a.wakuv2.prod:

But what was first, the chicken or the egg?

jakubgs commented 2 years ago

If we look at 7 days we can see some other spikes, so this is not new:

But we can see a steady growth in allocated TCP sockets:

That's on node-01.gc-us-central1-a.wakuv2.prod.

jakubgs commented 2 years ago

In comparison, the number of sockets used by node-01.do-ams3.wakuv2.prod - which is apparently fine - is lower than 100:

Could this be the reason @jm-clius ?

jakubgs commented 2 years ago

Also, on node-01.do-ams3.wakuv2.prod we can see that most connections are normally closed by users:

But on node-01.gc-us-central1-a.wakuv2.prod we can see that connections stopped being normally closed on 2022-05-12:

jm-clius commented 2 years ago

In comparison, the number of sockets used by node-01.do-ams3.wakuv2.prod - which is apparently fine - is lower than 100:

Thanks, this seems like a good place to start investigating! Since websockets, websockify and "normal" libp2p connections all use TCP it's possible that one of these are creating and not releasing sockets. Seems like Kibana is working again, so I'll try to determine which is happening from logs.

jm-clius commented 2 years ago

Although this may be another chicken/egg situation, many logs like these: From consul: Check socket connection failed: check=nim-waku-v2 error="dial tcp [::1]:30303: connect: connection refused From websockify: Failed to connect to node:30303: [Errno 110] Operation timed out

But it may be that users fail to close connections because the node is unresponsive and not that the node is unresponsive because of the open TCP connections. Or both aggravating the other.

jakubgs commented 2 years ago

Yeah, that Consul healthcheck is just a regular TCP handshake check on port 30303:

So it seems the process is running, the port is "open", but it's not responding to TCP handshakes:

admin@node-01.gc-us-central1-a.wakuv2.prod:~ % sudo netstat -lpnt | grep 30303
tcp        0      0 0.0.0.0:30303           0.0.0.0:*               LISTEN      1236/dockerd        
admin@node-01.gc-us-central1-a.wakuv2.prod:~ % sudo nmap -Pn -p30303 localhost
Starting Nmap 7.80 ( https://nmap.org ) at 2022-05-16 09:54 UTC
Nmap scan report for localhost (127.0.0.1)
Host is up.

PORT      STATE    SERVICE
30303/tcp filtered unknown

Nmap done: 1 IP address (1 host up) scanned in 15.33 seconds

jakubgs commented 2 years ago

The tcpdump looks like this:

admin@node-01.gc-us-central1-a.wakuv2.prod:~ % sudo tcpdump dst port 30303    
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-0e941356a4ee, link-type EN10MB (Ethernet), capture size 262144 bytes
09:56:51.707138 IP 172.17.1.2.33630 > 172.17.1.3.30303: Flags [S], seq 448629536, win 64240, options [mss 1460,sackOK,TS val 824209383 ecr 0,nop,wscale 7], length 0
09:56:52.737485 IP 172.17.1.2.33630 > 172.17.1.3.30303: Flags [S], seq 448629536, win 64240, options [mss 1460,sackOK,TS val 824210414 ecr 0,nop,wscale 7], length 0
09:56:53.121525 IP 172.17.1.2.33626 > 172.17.1.3.30303: Flags [S], seq 4242652308, win 64240, options [mss 1460,sackOK,TS val 824210798 ecr 0,nop,wscale 7], length 0
09:56:54.783757 IP 172.17.1.2.33630 > 172.17.1.3.30303: Flags [S], seq 448629536, win 64240, options [mss 1460,sackOK,TS val 824212460 ecr 0,nop,wscale 7], length 0
09:56:55.425508 IP 172.17.1.2.33618 > 172.17.1.3.30303: Flags [S], seq 1190165544, win 64240, options [mss 1460,sackOK,TS val 824213102 ecr 0,nop,wscale 7], length 0
09:56:57.533765 IP 172.17.1.2.33620 > 172.17.1.3.30303: Flags [S], seq 2408056701, win 64240, options [mss 1460,sackOK,TS val 824215210 ecr 0,nop,wscale 7], length 0
09:56:57.729531 IP 172.17.1.2.33628 > 172.17.1.3.30303: Flags [S], seq 2844005209, win 64240, options [mss 1460,sackOK,TS val 824215406 ecr 0,nop,wscale 7], length 0
09:56:59.033753 IP 172.17.1.2.33630 > 172.17.1.3.30303: Flags [S], seq 448629536, win 64240, options [mss 1460,sackOK,TS val 824216710 ecr 0,nop,wscale 7], length 0
...

So it looks like it's receiving SYNpackets but not responding to them in any way.

jm-clius commented 2 years ago

CPU Usage on node-01.gc-us-central1-a.wakuv2.prod jumped on the 12th at around 21.20 UTC.

This is also visible in the log rate reduction over the same period:

No real clues from the actual logs as to what might have happened:

It just shows a gap of 20 seconds. The last logs "Unexpected..." can be investigated, but is a general occurrence.

jakubgs commented 2 years ago

That indicates to me that the node process is stuck in some kind of loop, and not handling incoming connections.

jakubgs commented 2 years ago

If I strace the node process it does appear to be doing something:

admin@node-01.gc-us-central1-a.wakuv2.prod:~ % sudo strace -f -p $(pgrep wakunode)
strace: Process 1991222 attached with 3 threads
[pid 1991358] epoll_pwait(14,  <unfinished ...>
[pid 1991269] futex(0x7f9b908d6e34, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 1991358] <... epoll_pwait resumed>[], 64, 5936, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] sendto(29, "HTTP/1.1 408 Request Timeout\r\nda"..., 132, MSG_NOSIGNAL, NULL, 0) = 132
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_ctl(14, EPOLL_CTL_DEL, 29, 0x7f9b90246a50) = 0
[pid 1991358] close(29)                 = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [{EPOLLIN, {u32=13, u64=13}}], 64, -1, NULL, 8) = 1
[pid 1991358] accept(13, 0x7f9b90246b50, [0->16]) = 29
[pid 1991358] fcntl(29, F_GETFL)        = 0x2 (flags O_RDWR)
[pid 1991358] fcntl(29, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE) = 0
[pid 1991358] epoll_ctl(14, EPOLL_CTL_DEL, 13, 0x7f9b902468d0) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
...

jakubgs commented 2 years ago

There are some calls related to handling of the metrics endpoint:

[pid 1991358] sendto(29, "HTTP/1.1 200 OK\r\nContent-Length:"..., 159, MSG_NOSIGNAL, NULL, 0) = 159
[pid 1991358] sendto(29, "# HELP process_info CPU and memo"..., 22214, MSG_NOSIGNAL, NULL, 0) = 22214

So I can tell if the epoll_pwait is related to that or the LibP2P port.

Menduist commented 2 years ago

There is an open issue in nim-websock (used by libp2p for native WS) for a livelock: https://github.com/status-im/nim-websock/issues/94

Though in nimbus case it seemed to lock up the node completely Maybe you can attach with gdb an try to get a stack trace?

jakubgs commented 2 years ago

I don't think this issue is related to nim-websock library. In this case the LibP2P port is not responding, and Websockets in this fleet are provided by websockify.

jm-clius commented 2 years ago

In this case the LibP2P port is not responding

As far as I understand, the livelock would also cause the libp2p ports to not respond. It seems to me the main processing thread may be completely locked.

and Websockets in this fleet are provided by websockify.

The native websockets are also being used by several platforms.

@jakubgs, I have the following suggestion to proceed. LMK what you think.

Attach gdb to wakunode2 processes on both the locked nodes and see if we can get a backtrace (and perhaps some other information such as variable information, list of threads?)
Afterwards (temporarily) disable websockify on one of the two locked nodes and restart both - this way we can see if one runs differently with websockify?

This is unless we can think of another way to get more information about the status quo? Logs and metrics don't seem to show anything.

jakubgs commented 2 years ago

That sounds like a good plan. Do you need help attaching gdb to the process? Do we build the node with debugging symbols?

I don't see any issue with stopping Websockify on one host to see what's the difference.

jm-clius commented 2 years ago

That sounds like a good plan. Do you need help attaching gdb to the process?

I don't have ssh access to the prod fleet (only test), so would appreciate your help.

Do we build the node with debugging symbols?

Yes, afaict in roundabout manner. Can also see line numbers and files while running gdb on local instances.

I don't see any issue with stopping Websockify on one host to see what's the difference.

Thanks. Agreed.

jakubgs commented 2 years ago

@jm-clius I've granted you SSH access to the wakuv2.prod fleet: https://github.com/status-im/infra-nim-waku/commit/fcd8dc94

jm-clius commented 2 years ago

Doesn't seem to be related to the (known) websock livelock, even though at least one of the "stuck" hosts were receiving in a websock session:

node-01.ac-cn-hongkong-c.wakuv2.prod:

(gdb) bt
#0  0x0000561c18bad212 in nimFrame (s=s@entry=0x7ffea68d2930) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim:537
#1  0x0000561c18bb45de in rawAlloc__mE4QEVyMvGRVliDWDngZCQ (a=a@entry=0x7fee0165b0b8, requestedSize=requestedSize@entry=176) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/alloc.nim:743
#2  0x0000561c18bb74f7 in rawNewObj__ehkAaLROrd0Hc9aLROWt1nQ (typ=0x561c198a25e0 <NTI__OfootV6LbsUVkJ5rASpslQ_>, size=size@entry=160, gch=0x7fee0165b050)
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/gc.nim:412
#3  0x0000561c18bb8039 in newObj (typ=typ@entry=0x561c198a25e0 <NTI__OfootV6LbsUVkJ5rASpslQ_>, size=size@entry=160) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/gc.nim:440
#4  0x0000561c18df5068 in readOnce__q7PvVeYRu1Lqbk3mptp43w (rstream=0x7fedffc48a20, pbytes=0x7fedfc399021, nbytes=nbytes@entry=32798) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system.nim:230
#5  0x0000561c1923ff73 in read__DUL0fM2E8GVxH2HekPs9acQ (chronosInternalRetFuture=0x7fedfdf754c8, ClE_0=0x7fedff938b28) at /app/vendor/nim-websock/websock/frame.nim:76
#6  0x0000561c18d37ae4 in futureContinue__a9attitHJ4jxLpQcBdbtgug (fut=fut@entry=0x7fedfdf754c8) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#7  0x0000561c19240130 in read__tvPyWYPsxsd5IdgsDRQC6Q (frame=0x7fedfe45eac8, reader=0x7fedffc48a20, pbytes=<optimized out>, nbytes=<optimized out>) at /app/vendor/nim-websock/websock/frame.nim:63
#8  0x0000561c19293cae in recv__9cH3INEhxKFNZGvl9asXiZbA (chronosInternalRetFuture=<optimized out>, ClE_0=<optimized out>) at /app/vendor/nim-websock/websock/session.nim:348
#9  0x0000561c18d37ae4 in futureContinue__a9attitHJ4jxLpQcBdbtgug (fut=fut@entry=0x7fedfdf75448) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#10 0x0000561c192965d3 in recv__eSM1fUZn9aaCTJfqvK9cHhUw (ws=0x7fedffc48cc0, data_0=<optimized out>, size=<optimized out>) at /app/vendor/nim-websock/websock/session.nim:299
#11 0x0000561c192b303c in readOnce__0prVjOrxAtZ1A5D81q9bn9aQ (chronosInternalRetFuture=<optimized out>, ClE_0=<optimized out>) at /app/vendor/nim-libp2p/libp2p/transports/wstransport.nim:70
#12 0x0000561c18d37ae4 in futureContinue__a9attitHJ4jxLpQcBdbtgug (fut=fut@entry=0x7fedfdf753c8) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#13 0x0000561c192c2313 in readOnce__I0Bt9bpTU5X68zeVjassBuQ (s=0x7fedffaaf868, pbytes=<optimized out>, nbytes=<optimized out>) at /app/vendor/nim-libp2p/libp2p/transports/wstransport.nim:69
#14 0x0000000000ca1e86 in ?? ()
#15 0x0000561c18fbadc4 in readExactly__zR8MqXY9c8DZXGgIsfszb9cw_3 (chronosInternalRetFuture=<optimized out>, ClE_0=<optimized out>) at /app/vendor/nim-libp2p/libp2p/stream/lpstream.nim:169
#16 0x0000561c18d398b8 in futureContinue__XuNTB7fHwBI8KII0qEQaCw (fut=<optimized out>) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:368
#17 0x0000561c18d713b2 in poll__YNjd8fE6xG8CRNwfLnrx0g_2 () at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim:541
#18 0x0000561c18d8d58b in runForever__YNjd8fE6xG8CRNwfLnrx0g_3 () at /app/vendor/nim-chronos/chronos/asyncloop.nim:1085
#19 0x0000561c196a395b in NimMainModule () at /app/waku/v2/node/wakunode2.nim:1388
#20 0x0000561c196aadc2 in NimMain () at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:402
#21 0x0000561c18a1e64d in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>) at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:409

(gdb) info threads
  Id   Target Id             Frame
* 1    LWP 349217 "wakunode" srcLocImpl__oOJ0tj0dKOJ2xfORWMByvA () at /app/vendor/nim-chronos/chronos/srcloc.nim:35
  2    LWP 349267 "wakunode" 0x00007fee016ce413 in ?? () from target:/lib/ld-musl-x86_64.so.1
  3    LWP 349335 "wakunode" 0x00007fee016ce413 in ?? () from target:/lib/ld-musl-x86_64.so.1

node-01.gc-us-central1-a.wakuv2.prod:

0x00007f9b90a4137b in setjmp () from target:/lib/ld-musl-x86_64.so.1
(gdb) bt
#0  0x00007f9b90a4137b in setjmp () from target:/lib/ld-musl-x86_64.so.1
#1  0x00005579c0f17e6e in recv__9cH3INEhxKFNZGvl9asXiZbA (chronosInternalRetFuture=<optimized out>,
    ClE_0=<optimized out>) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim:128
#2  0x00005579c09bcae4 in futureContinue__a9attitHJ4jxLpQcBdbtgug (fut=<optimized out>)
    at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#3  0x00005579c09f63b2 in poll__YNjd8fE6xG8CRNwfLnrx0g_2 ()
    at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim:541
#4  0x00005579c0a1258b in runForever__YNjd8fE6xG8CRNwfLnrx0g_3 () at /app/vendor/nim-chronos/chronos/asyncloop.nim:1085
#5  0x00005579c132895b in NimMainModule () at /app/waku/v2/node/wakunode2.nim:1388
#6  0x00005579c132fdc2 in NimMain () at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:402
#7  0x00005579c06a364d in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>)
    at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:409

(gdb) info threads
  Id   Target Id              Frame
* 1    LWP 1991222 "wakunode" 0x00007f9b90a413a3 in setjmp () from target:/lib/ld-musl-x86_64.so.1
  2    LWP 1991269 "wakunode" 0x00007f9b90a50413 in ?? () from target:/lib/ld-musl-x86_64.so.1
  3    LWP 1991358 "wakunode" 0x00007f9b90a50413 in ?? () from target:/lib/ld-musl-x86_64.so.1

jm-clius commented 2 years ago

Bizarrely, CPU usage went down just by attaching and detaching gdb. However, shot back up.

Menduist commented 2 years ago

gdb pauses the program while you're running your commands, so it's expected

It seems that a livelock does come from nim-websock, would it be possible to capture multiple traces to help me find the issue? Maybe 5 or 10 would be ideal, if it's too much trouble for you, I can ask access to the server

jm-clius commented 2 years ago

gdb pauses the program while you're running your commands, so it's expected

Ah, of course. Clueless moment. CPU usage is down while I'm attached not after detaching. 🤦

It seems that a livelock does come from nim-websock

Not so sure. Only one of the hosts were inside a nim-websock session and it does seem to progress out of this state when continuing. May just be slow for some reason. Will add more traces.

jm-clius commented 2 years ago

@Menduist https://gist.github.com/jm-clius/02f5826d143c25ba16c0ed3794cb4c0d <-- 8 stacktraces (a bit at random intervals) for one of the nodes. Not very scientific, but seems to be in a websocket session "often" (perhaps slow), but not stuck.

jm-clius commented 2 years ago

I've restarted the "stuck" containers on node-01.ac-cn-hongkong-c.wakuv2.prod and node-01.gc-us-central1-a.wakuv2.prod. I've also disabled websockify on node-01.ac-cn-hongkong-c.wakuv2.prod to see if it makes any difference.

Menduist commented 2 years ago

Ok, it seems stuck in readOnce, I'll investigate this PM Thanks a lot, very helpful!

Menduist commented 2 years ago

I've found an issue in nim-websock and have a reliable repro, can confirm it's linked to https://github.com/status-im/nim-websock/issues/94 I'll follow up there with more details, and keep this issue posted when it's resolved

jm-clius commented 2 years ago

I've found an issue in nim-websock and have a reliable repro, can confirm it's linked to status-im/nim-websock#94 I'll follow up there with more details, and keep this issue posted when it's resolved

Excellent news! Thanks.

Menduist commented 2 years ago

Opened possible fix: https://github.com/status-im/nim-websock/pull/109

jm-clius commented 2 years ago

Issue occurred again on node-01.ac-cn-hongkong-c.wakuv2.prod - have restarted container for now. Will patch w fix when it's available.

jm-clius commented 2 years ago

Issue also now occurred on wakuv2.test fleet (it was only a matter of time). Same action taken - restarted containers.

jm-clius commented 2 years ago

nwaku with fix deployed to wakuv2.test: https://ci.status.im/job/nim-waku/job/deploy-v2-test/339/

Will let it run over the weekend and then consider either a fast release or patch for wakuv2.prod (likely the latter).

jm-clius commented 2 years ago

Although deploy job finished successfully, it seems as if the docker containers on wakuv2.test weren't recreated:

$ docker ps -a
CONTAINER ID   IMAGE                                COMMAND                  CREATED        STATUS                  PORTS
                                                                     NAMES
fab1f47f971c   statusteam/nim-waku:deploy-v2-test   "/usr/bin/wakunode -…"   8 days ago     Up 11 hours (healthy)   0.0.0.0:8000->8000/tcp, 0.0.0.0:8008->8008/tcp, 0.0.0.0:9000->9000/udp, 0.0.0.0:30303->30303/tcp, 127.0.0.1:8545->8545/tcp, 0.0.0.0:30303->30303/udp, 60000/tcp   nim-waku-v2

jakubgs commented 2 years ago

Looks updated now, the parameter for docker tag was changed for some reason:

admin@node-01.do-ams3.wakuv2.test:~ % d
CONTAINER ID   NAMES         IMAGE                                CREATED             STATUS
452f22cf8232   nim-waku-v2   statusteam/nim-waku:deploy-v2-test   About an hour ago   Up About an hour (healthy)
4dbb107698ad   websockify    statusteam/websockify:v0.10.0        7 days ago          Up 3 days

jakubgs commented 1 year ago

I assume this is no longer relevant.

status-im / infra-nim-waku

Prod fleet slows down #50