waku-org / nwaku

Waku node and protocol.
Other
201 stars 53 forks source link

bug: underlying websocket connection leak #1186

Closed jm-clius closed 1 year ago

jm-clius commented 2 years ago

Problem

An underlying bug is causing some incoming websocket connections to "leak", resulting in an increase in open file descriptors. It's not yet clear whether this is related to the nim-libp2p, nim-chronos or nim-websock libraries, so this tracking issue will be linked to the underlying issue and relevant repo once it's created.

Impact

Eventually a node exhibiting this leak will reach the maximum amount of open file descriptors. This could lead to:

To reproduce

This has been observed on the status.prod and wakuv2.test fleets. The status.test fleet, which runs the same version of nwaku has not shown the same leak. This is likely because the leak is triggered by very specific (possibly erratic) websocket behaviour exhibited by client(s) connected to only the fleets with leak detected.

Expected behavior

Connections should close properly and the corresponding file descriptor released in all cases.

Screenshots/logs

image

7 days of open file descriptor count on status.prod

nwaku version/commit hash

latest master on 2022/09/21: commit 11832411

cc @Menduist

Menduist commented 2 years ago

Posting my original investigation here:

16:52:28.586629 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000031>
16:52:35.008421 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 70 <0.000034>
16:52:58.624601 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000030>
16:53:05.009432 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 70 <0.000034>
16:53:09.844583 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 70 <0.000032> # bye bye 70
16:53:28.662283 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000037>
16:53:35.010786 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 71 <0.000062>
16:53:58.706336 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000036>
16:54:05.011209 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 71 <0.000035>
16:54:09.881759 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 71 <0.000033> # bye bye 71
16:54:28.742303 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000029>
16:54:35.019056 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 72 <0.000040>
16:54:58.785436 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000029>
16:55:05.020150 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 72 <0.000036>
16:55:09.941550 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 72 <0.000033> # bye bye 72
16:55:30.210264 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000036>
16:55:35.021139 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 73 <0.000038>
16:55:58.904547 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000030>
16:56:05.022469 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 73 <0.000032>
16:56:09.997573 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 73 <0.000047> # bye bye 73
16:56:28.941819 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 20 <0.000031>
16:56:35.023149 accept(10, 0x7ffe0be9b9e0, [0 => 28]) = 74 <0.000035>

Everytime we accept on the fd number 11, the returned fd get stuck fd number 11 = websocket But when I connect manually, using nc:

17:01:41.659418 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000040>
17:01:45.720612 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000037>
17:01:49.843990 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000046>
17:01:53.740448 accept(11, 0x7ffe0be9b9e0, [0 => 28]) = 77 <0.000079>

The fd get released correctly

16:54:09.901187 sendto(71, "\25\3\3\0\32\0\0\0\0\0\0\0\3\245OR\242\227\260\234]g\237\210T_\342d\226w\260", 31, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe)

16:55:09.990054 sendto(72, "\25\3\3\0\32\0\0\0\0\0\0\0\3\17\263\371+\205!\355\6*\275\351\340\225\313\253\232\30\353", 31, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe)

16:56:10.019808 sendto(73, "\25\3\3\0\32\0\0\0\0\0\0\0\3Yr\320q|nnt\376\371\276h\246n\370;\205\372", 31, MSG_NOSIGNAL, NULL, 0) = -1 EPIPE (Broken pipe) <0.000028>

Last sign of life we get for each stuck fd Soo somewhere amongst the chronos - websock - libp2p stack we don't handle EPIPE correctly

jakubgs commented 2 years ago

Added restarts every 6 hours with random 3 hour stagger to mitigate this temporarily:

jakubgs commented 2 years ago

Due to periodic restarts the issue might be less noticeable, but it is still there:

image

Any updates?

jm-clius commented 2 years ago

With many of the nim-libp2p/websock team participating at devcon, the fix is targeted for beginning-December (and release 0.14.0)

jakubgs commented 2 years ago

Okay, then if there is no fix in sight I will simply disable the Consul Websocket healthchecks, because I don't want to get pinged for a known issue that is not currently being researched: https://github.com/status-im/infra-role-nim-waku/commit/b9868446

Menduist commented 1 year ago

Status update: The two PRs above fix about 5 leaks, but there are still more to go! It's now exhibiting a new symptom, where the last sign of life in strace of a leaked FD is: recvfrom(22, "", 4096, 0, NULL, NULL) = 0 after a correct connection has been setup.

So again, somewhere in the stack this is not handled properly. Investigation will continue next week.

fryorcraken commented 1 year ago

@Menduist: Would it be worth merging and release the PRs above so that we can improve the status quo? We could then integrate changes to nwaku and deploy them on status.prod and status.test.

fryorcraken commented 1 year ago

@jakubgs what is the restart frequency at the moment for status.prod and status.test please? I tried to check the config but not able to relate.

jakubgs commented 1 year ago

@fryorcraken It's literally in he second comment on this issue, with commits: https://github.com/status-im/nwaku/issues/1186#issuecomment-1269854340

Menduist commented 1 year ago

Update: Now been running for 4h without any leaks thanks to further patches (and https://github.com/status-im/nim-chronos/issues/329) Will leave it running over the week-end to be sure, clean up the various PRs & merge them

jakubgs commented 1 year ago

Indeed, it looks very good:

image

Great work!

Menduist commented 1 year ago

Merged libp2p & websock PR, now waiting on https://github.com/status-im/nim-chronos/pull/330

jm-clius commented 1 year ago

@Menduist are you aware of any progress in merging the underlying fixes?

Menduist commented 1 year ago

No :/