Closed jm-clius closed 1 year ago
cc @jakubgs Moved from nwaku repo
Something happened at midnight:
We can see a major drop in free ram:
This is on node-01.ac-cn-hongkong-c.wakuv2.prod
.
We can see the same thing on node-01.gc-us-central1-a.wakuv2.prod
:
But what was first, the chicken or the egg?
If we look at 7 days we can see some other spikes, so this is not new:
But we can see a steady growth in allocated TCP sockets:
That's on node-01.gc-us-central1-a.wakuv2.prod
.
In comparison, the number of sockets used by node-01.do-ams3.wakuv2.prod
- which is apparently fine - is lower than 100:
Could this be the reason @jm-clius ?
Also, on node-01.do-ams3.wakuv2.prod
we can see that most connections are normally closed by users:
But on node-01.gc-us-central1-a.wakuv2.prod
we can see that connections stopped being normally closed on 2022-05-12:
In comparison, the number of sockets used by
node-01.do-ams3.wakuv2.prod
- which is apparently fine - is lower than 100:
Thanks, this seems like a good place to start investigating! Since websockets, websockify and "normal" libp2p connections all use TCP it's possible that one of these are creating and not releasing sockets. Seems like Kibana is working again, so I'll try to determine which is happening from logs.
Although this may be another chicken/egg situation, many logs like these:
From consul
:
Check socket connection failed: check=nim-waku-v2 error="dial tcp [::1]:30303: connect: connection refused
From websockify
:
Failed to connect to node:30303: [Errno 110] Operation timed out
But it may be that users fail to close connections because the node is unresponsive and not that the node is unresponsive because of the open TCP connections. Or both aggravating the other.
Yeah, that Consul healthcheck is just a regular TCP handshake check on port 30303
:
So it seems the process is running, the port is "open", but it's not responding to TCP handshakes:
admin@node-01.gc-us-central1-a.wakuv2.prod:~ % sudo netstat -lpnt | grep 30303
tcp 0 0 0.0.0.0:30303 0.0.0.0:* LISTEN 1236/dockerd
admin@node-01.gc-us-central1-a.wakuv2.prod:~ % sudo nmap -Pn -p30303 localhost
Starting Nmap 7.80 ( https://nmap.org ) at 2022-05-16 09:54 UTC
Nmap scan report for localhost (127.0.0.1)
Host is up.
PORT STATE SERVICE
30303/tcp filtered unknown
Nmap done: 1 IP address (1 host up) scanned in 15.33 seconds
The tcpdump
looks like this:
admin@node-01.gc-us-central1-a.wakuv2.prod:~ % sudo tcpdump dst port 30303
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-0e941356a4ee, link-type EN10MB (Ethernet), capture size 262144 bytes
09:56:51.707138 IP 172.17.1.2.33630 > 172.17.1.3.30303: Flags [S], seq 448629536, win 64240, options [mss 1460,sackOK,TS val 824209383 ecr 0,nop,wscale 7], length 0
09:56:52.737485 IP 172.17.1.2.33630 > 172.17.1.3.30303: Flags [S], seq 448629536, win 64240, options [mss 1460,sackOK,TS val 824210414 ecr 0,nop,wscale 7], length 0
09:56:53.121525 IP 172.17.1.2.33626 > 172.17.1.3.30303: Flags [S], seq 4242652308, win 64240, options [mss 1460,sackOK,TS val 824210798 ecr 0,nop,wscale 7], length 0
09:56:54.783757 IP 172.17.1.2.33630 > 172.17.1.3.30303: Flags [S], seq 448629536, win 64240, options [mss 1460,sackOK,TS val 824212460 ecr 0,nop,wscale 7], length 0
09:56:55.425508 IP 172.17.1.2.33618 > 172.17.1.3.30303: Flags [S], seq 1190165544, win 64240, options [mss 1460,sackOK,TS val 824213102 ecr 0,nop,wscale 7], length 0
09:56:57.533765 IP 172.17.1.2.33620 > 172.17.1.3.30303: Flags [S], seq 2408056701, win 64240, options [mss 1460,sackOK,TS val 824215210 ecr 0,nop,wscale 7], length 0
09:56:57.729531 IP 172.17.1.2.33628 > 172.17.1.3.30303: Flags [S], seq 2844005209, win 64240, options [mss 1460,sackOK,TS val 824215406 ecr 0,nop,wscale 7], length 0
09:56:59.033753 IP 172.17.1.2.33630 > 172.17.1.3.30303: Flags [S], seq 448629536, win 64240, options [mss 1460,sackOK,TS val 824216710 ecr 0,nop,wscale 7], length 0
...
So it looks like it's receiving SYN
packets but not responding to them in any way.
CPU Usage on node-01.gc-us-central1-a.wakuv2.prod
jumped on the 12th at around 21.20 UTC.
This is also visible in the log rate reduction over the same period:
No real clues from the actual logs as to what might have happened:
It just shows a gap of 20 seconds. The last logs "Unexpected..." can be investigated, but is a general occurrence.
That indicates to me that the node process is stuck in some kind of loop, and not handling incoming connections.
If I strace
the node process it does appear to be doing something:
admin@node-01.gc-us-central1-a.wakuv2.prod:~ % sudo strace -f -p $(pgrep wakunode)
strace: Process 1991222 attached with 3 threads
[pid 1991358] epoll_pwait(14, <unfinished ...>
[pid 1991269] futex(0x7f9b908d6e34, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 1991358] <... epoll_pwait resumed>[], 64, 5936, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] sendto(29, "HTTP/1.1 408 Request Timeout\r\nda"..., 132, MSG_NOSIGNAL, NULL, 0) = 132
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_ctl(14, EPOLL_CTL_DEL, 29, 0x7f9b90246a50) = 0
[pid 1991358] close(29) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
[pid 1991358] epoll_pwait(14, [{EPOLLIN, {u32=13, u64=13}}], 64, -1, NULL, 8) = 1
[pid 1991358] accept(13, 0x7f9b90246b50, [0->16]) = 29
[pid 1991358] fcntl(29, F_GETFL) = 0x2 (flags O_RDWR)
[pid 1991358] fcntl(29, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE) = 0
[pid 1991358] epoll_ctl(14, EPOLL_CTL_DEL, 13, 0x7f9b902468d0) = 0
[pid 1991358] epoll_pwait(14, [], 64, 0, NULL, 8) = 0
...
There are some calls related to handling of the metrics endpoint:
[pid 1991358] sendto(29, "HTTP/1.1 200 OK\r\nContent-Length:"..., 159, MSG_NOSIGNAL, NULL, 0) = 159
[pid 1991358] sendto(29, "# HELP process_info CPU and memo"..., 22214, MSG_NOSIGNAL, NULL, 0) = 22214
So I can tell if the epoll_pwait
is related to that or the LibP2P port.
There is an open issue in nim-websock (used by libp2p for native WS) for a livelock: https://github.com/status-im/nim-websock/issues/94
Though in nimbus case it seemed to lock up the node completely Maybe you can attach with gdb an try to get a stack trace?
I don't think this issue is related to nim-websock
library. In this case the LibP2P port is not responding, and Websockets in this fleet are provided by websockify.
In this case the LibP2P port is not responding
As far as I understand, the livelock would also cause the libp2p ports to not respond. It seems to me the main processing thread may be completely locked.
and Websockets in this fleet are provided by websockify.
The native websockets are also being used by several platforms.
@jakubgs, I have the following suggestion to proceed. LMK what you think.
wakunode2
processes on both the locked nodes and see if we can get a backtrace (and perhaps some other information such as variable information, list of threads?)This is unless we can think of another way to get more information about the status quo? Logs and metrics don't seem to show anything.
That sounds like a good plan. Do you need help attaching gdb
to the process? Do we build the node with debugging symbols?
I don't see any issue with stopping Websockify on one host to see what's the difference.
That sounds like a good plan. Do you need help attaching
gdb
to the process?
I don't have ssh access to the prod fleet (only test), so would appreciate your help.
Do we build the node with debugging symbols?
Yes, afaict in roundabout manner. Can also see line numbers and files while running gdb on local instances.
I don't see any issue with stopping Websockify on one host to see what's the difference.
Thanks. Agreed.
@jm-clius I've granted you SSH access to the wakuv2.prod
fleet: https://github.com/status-im/infra-nim-waku/commit/fcd8dc94
Doesn't seem to be related to the (known) websock livelock, even though at least one of the "stuck" hosts were receiving in a websock session:
node-01.ac-cn-hongkong-c.wakuv2.prod
:
(gdb) bt
#0 0x0000561c18bad212 in nimFrame (s=s@entry=0x7ffea68d2930) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim:537
#1 0x0000561c18bb45de in rawAlloc__mE4QEVyMvGRVliDWDngZCQ (a=a@entry=0x7fee0165b0b8, requestedSize=requestedSize@entry=176) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/alloc.nim:743
#2 0x0000561c18bb74f7 in rawNewObj__ehkAaLROrd0Hc9aLROWt1nQ (typ=0x561c198a25e0 <NTI__OfootV6LbsUVkJ5rASpslQ_>, size=size@entry=160, gch=0x7fee0165b050)
at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/gc.nim:412
#3 0x0000561c18bb8039 in newObj (typ=typ@entry=0x561c198a25e0 <NTI__OfootV6LbsUVkJ5rASpslQ_>, size=size@entry=160) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/gc.nim:440
#4 0x0000561c18df5068 in readOnce__q7PvVeYRu1Lqbk3mptp43w (rstream=0x7fedffc48a20, pbytes=0x7fedfc399021, nbytes=nbytes@entry=32798) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system.nim:230
#5 0x0000561c1923ff73 in read__DUL0fM2E8GVxH2HekPs9acQ (chronosInternalRetFuture=0x7fedfdf754c8, ClE_0=0x7fedff938b28) at /app/vendor/nim-websock/websock/frame.nim:76
#6 0x0000561c18d37ae4 in futureContinue__a9attitHJ4jxLpQcBdbtgug (fut=fut@entry=0x7fedfdf754c8) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#7 0x0000561c19240130 in read__tvPyWYPsxsd5IdgsDRQC6Q (frame=0x7fedfe45eac8, reader=0x7fedffc48a20, pbytes=<optimized out>, nbytes=<optimized out>) at /app/vendor/nim-websock/websock/frame.nim:63
#8 0x0000561c19293cae in recv__9cH3INEhxKFNZGvl9asXiZbA (chronosInternalRetFuture=<optimized out>, ClE_0=<optimized out>) at /app/vendor/nim-websock/websock/session.nim:348
#9 0x0000561c18d37ae4 in futureContinue__a9attitHJ4jxLpQcBdbtgug (fut=fut@entry=0x7fedfdf75448) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#10 0x0000561c192965d3 in recv__eSM1fUZn9aaCTJfqvK9cHhUw (ws=0x7fedffc48cc0, data_0=<optimized out>, size=<optimized out>) at /app/vendor/nim-websock/websock/session.nim:299
#11 0x0000561c192b303c in readOnce__0prVjOrxAtZ1A5D81q9bn9aQ (chronosInternalRetFuture=<optimized out>, ClE_0=<optimized out>) at /app/vendor/nim-libp2p/libp2p/transports/wstransport.nim:70
#12 0x0000561c18d37ae4 in futureContinue__a9attitHJ4jxLpQcBdbtgug (fut=fut@entry=0x7fedfdf753c8) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#13 0x0000561c192c2313 in readOnce__I0Bt9bpTU5X68zeVjassBuQ (s=0x7fedffaaf868, pbytes=<optimized out>, nbytes=<optimized out>) at /app/vendor/nim-libp2p/libp2p/transports/wstransport.nim:69
#14 0x0000000000ca1e86 in ?? ()
#15 0x0000561c18fbadc4 in readExactly__zR8MqXY9c8DZXGgIsfszb9cw_3 (chronosInternalRetFuture=<optimized out>, ClE_0=<optimized out>) at /app/vendor/nim-libp2p/libp2p/stream/lpstream.nim:169
#16 0x0000561c18d398b8 in futureContinue__XuNTB7fHwBI8KII0qEQaCw (fut=<optimized out>) at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:368
#17 0x0000561c18d713b2 in poll__YNjd8fE6xG8CRNwfLnrx0g_2 () at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim:541
#18 0x0000561c18d8d58b in runForever__YNjd8fE6xG8CRNwfLnrx0g_3 () at /app/vendor/nim-chronos/chronos/asyncloop.nim:1085
#19 0x0000561c196a395b in NimMainModule () at /app/waku/v2/node/wakunode2.nim:1388
#20 0x0000561c196aadc2 in NimMain () at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:402
#21 0x0000561c18a1e64d in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>) at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:409
(gdb) info threads
Id Target Id Frame
* 1 LWP 349217 "wakunode" srcLocImpl__oOJ0tj0dKOJ2xfORWMByvA () at /app/vendor/nim-chronos/chronos/srcloc.nim:35
2 LWP 349267 "wakunode" 0x00007fee016ce413 in ?? () from target:/lib/ld-musl-x86_64.so.1
3 LWP 349335 "wakunode" 0x00007fee016ce413 in ?? () from target:/lib/ld-musl-x86_64.so.1
node-01.gc-us-central1-a.wakuv2.prod
:
0x00007f9b90a4137b in setjmp () from target:/lib/ld-musl-x86_64.so.1
(gdb) bt
#0 0x00007f9b90a4137b in setjmp () from target:/lib/ld-musl-x86_64.so.1
#1 0x00005579c0f17e6e in recv__9cH3INEhxKFNZGvl9asXiZbA (chronosInternalRetFuture=<optimized out>,
ClE_0=<optimized out>) at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim:128
#2 0x00005579c09bcae4 in futureContinue__a9attitHJ4jxLpQcBdbtgug (fut=<optimized out>)
at /app/vendor/nim-chronos/chronos/asyncfutures2.nim:365
#3 0x00005579c09f63b2 in poll__YNjd8fE6xG8CRNwfLnrx0g_2 ()
at /app/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim:541
#4 0x00005579c0a1258b in runForever__YNjd8fE6xG8CRNwfLnrx0g_3 () at /app/vendor/nim-chronos/chronos/asyncloop.nim:1085
#5 0x00005579c132895b in NimMainModule () at /app/waku/v2/node/wakunode2.nim:1388
#6 0x00005579c132fdc2 in NimMain () at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:402
#7 0x00005579c06a364d in main (argc=<optimized out>, args=<optimized out>, env=<optimized out>)
at /app/vendor/dnsclient.nim/src/dnsclientpkg/types.nim:409
(gdb) info threads
Id Target Id Frame
* 1 LWP 1991222 "wakunode" 0x00007f9b90a413a3 in setjmp () from target:/lib/ld-musl-x86_64.so.1
2 LWP 1991269 "wakunode" 0x00007f9b90a50413 in ?? () from target:/lib/ld-musl-x86_64.so.1
3 LWP 1991358 "wakunode" 0x00007f9b90a50413 in ?? () from target:/lib/ld-musl-x86_64.so.1
Bizarrely, CPU usage went down just by attaching and detaching gdb. However, shot back up.
gdb pauses the program while you're running your commands, so it's expected
It seems that a livelock does come from nim-websock, would it be possible to capture multiple traces to help me find the issue? Maybe 5 or 10 would be ideal, if it's too much trouble for you, I can ask access to the server
gdb pauses the program while you're running your commands, so it's expected
Ah, of course. Clueless moment. CPU usage is down while I'm attached not after detaching. 🤦
It seems that a livelock does come from nim-websock
Not so sure. Only one of the hosts were inside a nim-websock
session and it does seem to progress out of this state when continuing. May just be slow for some reason. Will add more traces.
@Menduist https://gist.github.com/jm-clius/02f5826d143c25ba16c0ed3794cb4c0d <-- 8 stacktraces (a bit at random intervals) for one of the nodes. Not very scientific, but seems to be in a websocket session "often" (perhaps slow), but not stuck.
I've restarted the "stuck" containers on node-01.ac-cn-hongkong-c.wakuv2.prod
and node-01.gc-us-central1-a.wakuv2.prod
.
I've also disabled websockify on node-01.ac-cn-hongkong-c.wakuv2.prod
to see if it makes any difference.
Ok, it seems stuck in readOnce
, I'll investigate this PM
Thanks a lot, very helpful!
I've found an issue in nim-websock and have a reliable repro, can confirm it's linked to https://github.com/status-im/nim-websock/issues/94 I'll follow up there with more details, and keep this issue posted when it's resolved
I've found an issue in nim-websock and have a reliable repro, can confirm it's linked to status-im/nim-websock#94 I'll follow up there with more details, and keep this issue posted when it's resolved
Excellent news! Thanks.
Opened possible fix: https://github.com/status-im/nim-websock/pull/109
Issue occurred again on node-01.ac-cn-hongkong-c.wakuv2.prod
- have restarted container for now. Will patch w fix when it's available.
Issue also now occurred on wakuv2.test
fleet (it was only a matter of time). Same action taken - restarted containers.
nwaku with fix deployed to wakuv2.test
: https://ci.status.im/job/nim-waku/job/deploy-v2-test/339/
Will let it run over the weekend and then consider either a fast release or patch for wakuv2.prod
(likely the latter).
Although deploy job finished successfully, it seems as if the docker containers on wakuv2.test
weren't recreated:
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS
NAMES
fab1f47f971c statusteam/nim-waku:deploy-v2-test "/usr/bin/wakunode -…" 8 days ago Up 11 hours (healthy) 0.0.0.0:8000->8000/tcp, 0.0.0.0:8008->8008/tcp, 0.0.0.0:9000->9000/udp, 0.0.0.0:30303->30303/tcp, 127.0.0.1:8545->8545/tcp, 0.0.0.0:30303->30303/udp, 60000/tcp nim-waku-v2
Looks updated now, the parameter for docker tag was changed for some reason:
admin@node-01.do-ams3.wakuv2.test:~ % d
CONTAINER ID NAMES IMAGE CREATED STATUS
452f22cf8232 nim-waku-v2 statusteam/nim-waku:deploy-v2-test About an hour ago Up About an hour (healthy)
4dbb107698ad websockify statusteam/websockify:v0.10.0 7 days ago Up 3 days
I assume this is no longer relevant.
The following is observed on the wakuv2.prod fleet (2022-05-16 8 UTC):
node-01.ac-cn-hongkong-c.wakuv2.prod
andnode-01.gc-us-central1-a.wakuv2.prod
has increased to 30% and 60% respectively.node-01.do-ams3.wakuv2.prod
remains seemingly unaffected (for now)