waku-org / nwaku

Waku node and protocol.
Other
183 stars 46 forks source link

bug: running testwaku can hang in some cases of UPnP or nat-pmp networking #2628

Open NagyZoltanPeter opened 3 weeks ago

NagyZoltanPeter commented 3 weeks ago

Problem

I was wondering why I cannot run testwaku (all_test_waku binary) on my local machine. I noticed that itt still running after an hour or two. While analyzing the issue I found this is not the case always. CI job will normally finish in ~20min also for @Ivansete-status he checked it is running well and finishes in 15min. While me and @gabrielmer having the same issue, the process seems to hang in a specific place. Also noticed 20 min gaps appearing between some actions.

TRC 2024-04-23 **21:10**:48.603+02:00 Switch stopped                             topics="libp2p switch" tid=2684446 file=switch.nim:322
.INF 2024-04-23 **21:30**:46.259+02:00 UPnP: added port mapping                   topics="eth net nat" tid=2693539 file=nat.nim:174 externalPort=60000 internalPort=60000 protocol=TCP
INF 2024-04-23 21:30:46.275+02:00 UPnP: added port mapping                   topics="eth net nat" tid=2693539 file=nat.nim:174 externalPort=60000 internalPort=60000 protocol=UDP
INF 2024-04-23 21:30:47.576+02:00 UPnP: added port mapping                   topics="eth net nat" tid=2693568 file=nat.nim:174 externalPort=60000 internalPort=60000 protocol=TCP
INF 2024-04-23 21:30:47.589+02:00 UPnP: added port mapping                   topics="eth net nat" tid=2693568 file=nat.nim:174 externalPort=60000 internalPort=60000 protocol=UDP
INF 2024-04-23 21:30:48.949+02:00 UPnP: added port mapping                   topics="eth net nat" tid=2693569 file=nat.nim:174 externalPort=60000 internalPort=60000 protocol=TCP
INF 2024-04-23 **21:30**:48.964+02:00 UPnP: added port mapping                   topics="eth net nat" tid=2693569 file=nat.nim:174 externalPort=60000 internalPort=60000 protocol=UDP
INF 2024-04-23 **21:50**:46.477+02:00 UPnP: added port mapping                   topics="eth net nat" tid=2693539 file=nat.nim:174 externalPort=60000 internalPort=60000 protocol=TCP
INF 2024-04-23 21:50:46.490+02:00 UPnP: added port mapping                   topics="eth net nat" tid=2693539 file=nat.nim:174 externalPort=60000 internalPort=60000 protocol=UDP

Comparing logs from CI run and from @Ivansete-status I found that in those cases even if nim-eth nat.nim tries to establish either UPnP or nat-pmp port forwarding it does not succeed for some reason (some router / networking setup probably). This it will give up trying to create nat port forwarding. Luckily it will enable all test to run properly. That's a coincidence that such failure case is not handled as error later on.

But for my case UPnP NAT seems working... for a while. At some point it start hanging.

The reason is: Initiated from here: https://github.c om/waku-org/nwaku/blob/c41828150b03dd0dd42ba8004945876f2b94741c/waku/common/utils/nat.nim#L38 to here: https://github.com/status-im/nim-eth/blob/d66a29db7ca4372dba116928f979e92cb7f7661f/eth/net/nat.nim#L289 and will stuck in this loop here: https://github.com/status-im/nim-eth/blob/d66a29db7ca4372dba116928f979e92cb7f7661f/eth/net/nat.nim#L224

In some circumstances of all_test_waku test run it will be in deadlock with joinThread here: https://github.com/status-im/nim-eth/blob/d66a29db7ca4372dba116928f979e92cb7f7661f/eth/net/nat.nim#L240

There is the explicit 20 minutes retrial duration which explains the gaps in the log.

I have just hypothesis for now, as it maybe another thread is started this creating a race condition on the channel that is said to be not to shared between threads.

Further investigation needed!

Side note: I tried to set our test cases to run with --nat:none but that implies to use --extIp flag. So it cannot simulate the situation of CI run where nat network is failed to set up.

Impact

For some of us it is critical that we cannot check our changes before issuing a PR if there is no regression.

Expected behavior

All tests should be able to run with the same configuration everywhere

Logs:

Logs of exiting shows the hang in jointThread infinite wait for the thread to finish.

INF 2024-04-23 22:50:49.795+02:00 UPnP: added port mapping                   topics="eth net nat" tid=2693569 file=nat.nim:174 externalPort=60000 internalPort=60000 protocol=UDP
^CTraceback (most recent call last, using override)
../csu/libc-start.c(392)
../sysdeps/nptl/libc_start_call_main.h(74)
./stdlib/exit.c(143)
./stdlib/exit.c(113)
/home/nzp/dev/status/nwaku/vendor/nim-eth/eth/net/nat.nim(240) stopNatThread
/home/nzp/dev/status/nwaku/vendor/nimbus-build-system/vendor/Nim/lib/system/threads.nim(259) joinThread
./nptl/pthread_join_common.c(105)
./nptl/futex-internal.c(139)
./nptl/futex-internal.c(87)
./nptl/futex-internal.c(57)
/home/nzp/dev/status/nwaku/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(631) signalHandler

NAT-PMP has also an isse on my machine, even if not hangin' but tests may fail.

TRC 2024-04-24 00:26:17.279+02:00 Switch stopped                             topics="libp2p switch" tid=2817139 file=switch.nim:322
.DBG 2024-04-24 00:26:17.293+02:00 NAT-PMP: deleted port mapping              topics="eth net nat" tid=2817139 file=nat.nim:275 externalPort=60000 internalPort=60000 protocol=TCP
DBG 2024-04-24 00:26:17.303+02:00 NAT-PMP: deleted port mapping              topics="eth net nat" tid=2817139 file=nat.nim:275 externalPort=60000 internalPort=60000 protocol=UDP
WRN 2024-04-24 00:26:17.304+02:00 Failed to stop NAT port mapping renewal thread topics="eth net nat" tid=2817139 file=nat.nim:243 exc="cannot send message; thread died"
DBG 2024-04-24 00:26:17.314+02:00 NAT-PMP: deleted port mapping              topics="eth net nat" tid=2817139 file=nat.nim:275 externalPort=60000 internalPort=60000 protocol=TCP
DBG 2024-04-24 00:26:17.321+02:00 NAT-PMP: deleted port mapping              topics="eth net nat" tid=2817139 file=nat.nim:275 externalPort=60000 internalPort=60000 protocol=UDP
WRN 2024-04-24 00:26:17.322+02:00 Failed to stop NAT port mapping renewal thread topics="eth net nat" tid=2817139 file=nat.nim:243 exc="cannot send message; thread died"
DBG 2024-04-24 00:26:17.332+02:00 NAT-PMP: deleted port mapping              topics="eth net nat" tid=2817139 file=nat.nim:275 externalPort=60000 internalPort=60000 protocol=TCP
DBG 2024-04-24 00:26:17.339+02:00 NAT-PMP: deleted port mapping              topics="eth net nat" tid=2817139 file=nat.nim:275 externalPort=60000 internalPort=60000 protocol=UDP
WRN 2024-04-24 00:26:17.339+02:00 Failed to stop NAT port mapping renewal thread topics="eth net nat" tid=2817139 file=nat.nim:243 exc="cannot send message; thread died"
DBG 2024-04-24 00:26:17.350+02:00 NAT-PMP: deleted port mapping              topics="eth net nat" tid=2817139 file=nat.nim:275 externalPort=60000 internalPort=60000 protocol=TCP
DBG 2024-04-24 00:26:17.358+02:00 NAT-PMP: deleted port mapping              topics="eth net nat" tid=2817139 file=nat.nim:275 externalPort=60000 internalPort=60000 protocol=UDP
(0.06s)
[Summary] 565 tests run (828.3s): 558 OK, 1 FAILED, 6 SKIPPED

CI logs:

2024-04-23T17:01:17.4597794Z DBG 2024-04-23 17:01:17.459+00:00 UPnP                                       topics="eth net nat" tid=10497 file=nat.nim:76 msg="Internet Gateway Device not found. Giving up."
2024-04-23T17:01:17.6897701Z CC: waku_store/test_wakunode_store.nim
2024-04-23T17:01:21.2132450Z DBG 2024-04-23 17:01:21.211+00:00 NAT-PMP                                    topics="eth net nat" tid=10497 file=nat.nim:100 msg="the gateway does not support nat-pmp"