Closed jm-clius closed 1 year ago
Posting my original investigation here:
16:52:28.586629 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000031>
16:52:35.008421 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 70 <0.000034>
16:52:58.624601 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000030>
16:53:05.009432 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 70 <0.000034>
16:53:09.844583 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 70 <0.000032> # bye bye 70
16:53:28.662283 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000037>
16:53:35.010786 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 71 <0.000062>
16:53:58.706336 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000036>
16:54:05.011209 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 71 <0.000035>
16:54:09.881759 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 71 <0.000033> # bye bye 71
16:54:28.742303 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000029>
16:54:35.019056 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 72 <0.000040>
16:54:58.785436 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000029>
16:55:05.020150 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 72 <0.000036>
16:55:09.941550 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 72 <0.000033> # bye bye 72
16:55:30.210264 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000036>
16:55:35.021139 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 73 <0.000038>
16:55:58.904547 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000030>
16:56:05.022469 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 73 <0.000032>
16:56:09.997573 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 73 <0.000047> # bye bye 73
16:56:28.941819 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000031>
16:56:35.023149 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 74 <0.000035>
Everytime we accept on the fd number 11, the returned fd get stuck fd number 11 = websocket But when I connect manually, using nc:
17:01:41.659418 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000040>
17:01:45.720612 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000037>
17:01:49.843990 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000046>
17:01:53.740448 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000079>
The fd get released correctly
16:54:09.901187 sendto(71, "\25\3\3\0\32\0\0\0\0\0\0\0\3\245OR\242\227\260\234]g\237\210T_\342d\226w\260", 31, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe)
16:55:09.990054 sendto(72, "\25\3\3\0\32\0\0\0\0\0\0\0\3\17\263\371+\205!\355\6*\275\351\340\225\313\253\232\30\353", 31, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe)
16:56:10.019808 sendto(73, "\25\3\3\0\32\0\0\0\0\0\0\0\3Yr\320q|nnt\376\371\276h\246n\370;\205\372", 31, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe) <0.000028>
Last sign of life we get for each stuck fd Soo somewhere amongst the chronos - websock - libp2p stack we don't handle EPIPE correctly
Added restarts every 6 hours with random 3 hour stagger to mitigate this temporarily:
Due to periodic restarts the issue might be less noticeable, but it is still there:
Any updates?
With many of the nim-libp2p/websock team participating at devcon, the fix is targeted for beginning-December (and release 0.14.0
)
Okay, then if there is no fix in sight I will simply disable the Consul Websocket healthchecks, because I don't want to get pinged for a known issue that is not currently being researched: https://github.com/status-im/infra-role-nim-waku/commit/b9868446
Status update:
The two PRs above fix about 5 leaks, but there are still more to go!
It's now exhibiting a new symptom, where the last sign of life in strace of a leaked FD is:
recvfrom(22, "", 4096, 0, NULL, NULL) = 0
after a correct connection has been setup.
So again, somewhere in the stack this is not handled properly. Investigation will continue next week.
@Menduist: Would it be worth merging and release the PRs above so that we can improve the status quo? We could then integrate changes to nwaku and deploy them on status.prod and status.test.
@jakubgs what is the restart frequency at the moment for status.prod and status.test please? I tried to check the config but not able to relate.
@fryorcraken It's literally in he second comment on this issue, with commits: https://github.com/status-im/nwaku/issues/1186#issuecomment-1269854340
Update: Now been running for 4h without any leaks thanks to further patches (and https://github.com/status-im/nim-chronos/issues/329) Will leave it running over the week-end to be sure, clean up the various PRs & merge them
Indeed, it looks very good:
Great work!
Merged libp2p & websock PR, now waiting on https://github.com/status-im/nim-chronos/pull/330
@Menduist are you aware of any progress in merging the underlying fixes?
No :/
Problem
An underlying bug is causing some incoming websocket connections to "leak", resulting in an increase in open file descriptors. It's not yet clear whether this is related to the
nim-libp2p
,nim-chronos
ornim-websock
libraries, so this tracking issue will be linked to the underlying issue and relevant repo once it's created.Impact
Eventually a node exhibiting this leak will reach the maximum amount of open file descriptors. This could lead to:
To reproduce
This has been observed on the
status.prod
andwakuv2.test
fleets. Thestatus.test
fleet, which runs the same version ofnwaku
has not shown the same leak. This is likely because the leak is triggered by very specific (possibly erratic) websocket behaviour exhibited by client(s) connected to only the fleets with leak detected.Expected behavior
Connections should close properly and the corresponding file descriptor released in all cases.
Screenshots/logs
7 days of open file descriptor count on
status.prod
nwaku version/commit hash
latest
master
on 2022/09/21: commit11832411
cc @Menduist