waku-org / nwaku

Waku node and protocol.
Other
201 stars 53 forks source link

Fix Windows CI / lightpub tests #515

Closed oskarth closed 3 years ago

oskarth commented 3 years ago

For some reason Windows CI is failing with lightpub wakunode tests. Unclear why. Timing issue?

https://github.com/status-im/nim-waku/pull/509

oskarth commented 3 years ago

Alright, I tried again and can't figure it out. I don't have access to a Windows machine so the current feedback loop for me is Windows CI, which takes 50m to build. Looking at the code I don't see anything obviously wrong, and it works for Mac and Linux. The only odd thing I notice is DBG 2021-05-24 13:05:01.000+08:00 Exception in accept loop, exiting topics="libp2p switch" tid=593080 exc="Transport closed, no more connections!" but we see this in a lot of tests, so I think this is expected.

See https://github.com/status-im/nim-waku/pull/522/checks?check_run_id=2653040184

DBG 2021-05-03 09:11:24.681+00:00 no handlers                                topics="libp2p multistream" tid=1244 conn=16U*qobXj7:608fbe3cc958000000004769 protocol=/vac/waku/relay/2.0.0-beta2
DBG 2021-05-03 09:11:24.681+00:00 decodeMsg: decoded message                 topics="libp2p identify" tid=1244 pubkey=secp25...22]))) addresses=@[/ip4/0.0.0.0/tcp/60010] protocols="@[\"/ipfs/id/1.0.0\", \"/vac/waku/lightpush/2.0.0-alpha1\"]" observable_address=Some(/ip4/127.0.0.1/tcp/60012) proto_version="Some(\"ipfs/0.1.0\")" agent_version="Some(\"nim-libp2p/0.0.1\")"
DBG 2021-05-03 09:11:24.682+00:00 decodeMsg: decoded message                 topics="libp2p identify" tid=1244 pubkey=secp25...36]))) addresses=@[/ip4/0.0.0.0/tcp/60012] protocols="@[\"/ipfs/id/1.0.0\", \"/vac/waku/relay/2.0.0-beta2\", \"/vac/waku/lightpush/2.0.0-alpha1\"]" observable_address=Some(/ip4/127.0.0.1/tcp/50571) proto_version="Some(\"ipfs/0.1.0\")" agent_version="Some(\"nim-libp2p/0.0.1\")"
DBG 2021-05-03 09:11:24.682+00:00 no handlers                                topics="libp2p multistream" tid=1244 conn=16U*qobXj7:608fbe3cc958000000004769 protocol=/vac/waku/relay/2.0.0-beta2
DBG 2021-05-03 09:11:24.683+00:00 Dial successful                            topics="libp2p switch" tid=1244 conn=16U*qobXj7:608fbe3cc958000000004763 peerInfo="(peerId: \"16Uiu2HAmC7qWZFpHL6bZoKPa1869xpTYFtQZ3dWm2pexegqobXj7\", addrs: @[\"/ip4/0.0.0.0/tcp/60012\"], protocols: @[\"/ipfs/id/1.0.0\", \"/vac/waku/relay/2.0.0-beta2\", \"/vac/waku/lightpush/2.0.0-alpha1\"], protoVersion: \"ipfs/0.1.0\", agentVersion: \"nim-libp2p/0.0.1\")"
DBG 2021-05-03 09:11:24.684+00:00 Could not establish send connection        topics="libp2p pubsubpeer" tid=1244 msg="Unable to select sub-protocol @[\"/vac/waku/relay/2.0.0-beta2\", \"/vac/waku/relay/2.0.0-beta2\"]"
DBG 2021-05-03 09:11:24.684+00:00 unsubscribing pubsub peer                  topics="libp2p pubsub" tid=1244 peerId=16U*S5Qrtv
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(460) main
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(450) NimMain
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(441) PreMain
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(227) PreMainInner
D:/a/nim-waku/nim-waku/vendor/nim-unittest2/unittest2.nim(704) suitePayload
D:/a/nim-waku/nim-waku/vendor/nim-testutils/testutils/unittests.nim(16) runTestX60gensym176890001
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncloop.nim(1092) waitFor
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncloop.nim(281) poll
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
D:/a/nim-waku/nim-waku/waku/v2/protocol/waku_lightpush/waku_lightpush.nim(172) request
D:/a/nim-waku/nim-waku/vendor/nim-libp2p/libp2p/stream/lpstream.nim(244) writeLp
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncmacro2.nim(101) write
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
D:/a/nim-waku/nim-waku/vendor/nim-libp2p/libp2p/muxers/mplex/lpchannel.nim(171) write
[[reraised from:
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(460) main
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(450) NimMain
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(441) PreMain
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(227) PreMainInner
D:/a/nim-waku/nim-waku/vendor/nim-unittest2/unittest2.nim(704) suitePayload
D:/a/nim-waku/nim-waku/vendor/nim-testutils/testutils/unittests.nim(16) runTestX60gensym176890001
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncloop.nim(1092) waitFor
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncloop.nim(281) poll
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncmacro2.nim(77) colonanonymous
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncmacro2.nim(367) request
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncfutures2.nim(472) internalCheckComplete
]]
[[reraised from:
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(460) main
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(450) NimMain
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(441) PreMain
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(227) PreMainInner
D:/a/nim-waku/nim-waku/vendor/nim-unittest2/unittest2.nim(704) suitePayload
D:/a/nim-waku/nim-waku/vendor/nim-testutils/testutils/unittests.nim(16) runTestX60gensym176890001
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncloop.nim(1092) waitFor
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncloop.nim(281) poll
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncmacro2.nim(367) lightpush
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncfutures2.nim(472) internalCheckComplete
]]
[[reraised from:
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(460) main
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(450) NimMain
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(441) PreMain
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(227) PreMainInner
D:/a/nim-waku/nim-waku/vendor/nim-unittest2/unittest2.nim(704) suitePayload
D:/a/nim-waku/nim-waku/vendor/nim-testutils/testutils/unittests.nim(16) runTestX60gensym176890001
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncloop.nim(1092) waitFor
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncloop.nim(281) poll
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncmacro2.nim(367) scenario
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncfutures2.nim(472) internalCheckComplete
]]
[[reraised from:
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(460) main
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(450) NimMain
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(441) PreMain
D:/a/nim-waku/nim-waku/vendor/nimbus-build-system/vendor/Nim/lib/pure/unittest.nim(227) PreMainInner
D:/a/nim-waku/nim-waku/vendor/nim-unittest2/unittest2.nim(704) suitePayload
D:/a/nim-waku/nim-waku/vendor/nim-testutils/testutils/unittests.nim(16) runTestX60gensym176890001
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncfutures2.nim(485) read
D:/a/nim-waku/nim-waku/vendor/nim-chronos/chronos/asyncfutures2.nim(472) internalCheckComplete
]]

    Unhandled exception: Stream Closed! [LPStreamClosedError]
  [FAILED] Lightpush message return success
...
D:\a\nim-waku\nim-waku\vendor\nimbus-build-system\vendor\Nim\lib\system\nimscript.nim(260, 7) Error: unhandled exception: FAILED: build/all_tests_v2 [OSError]
mingw32-make: *** [Makefile:125: test2] Error 1
Error: Process completed with exit code 1.

Suggest we put this in icebox for now, and when it makes sense to revisit probably good to have a Windows environment for better debugging.

@jm-clius let me know if you have any obvious ideas of what you think is going wrong.

jm-clius commented 3 years ago

CI is failing because even a non-relay capable node requires relay mounted to use lightpush. See https://github.com/status-im/nim-waku/pull/518. The way to fix this is to add:

node1.mountRelay(relayMessage=false)

before mounting lightpush.

Why should relay be mounted?

In short: if a non-relay node tries to dial a relay peer (say to perform a lightpush), the relay peer will attempt to add that node as a relay/pubsub peer, which will fail in the protocol negotiation phase:

DBG 2021-05-24 05:35:43.636+00:00 Could not establish send connection        topics="libp2p pubsubpeer" tid=6200 msg="Unable to select sub-protocol @[\"/vac/waku/relay/2.0.0-beta2\", \"/vac/waku/relay/2.0.0-beta2\"]"

This in turn causes the relay peer to remove that peer and all its connections, including the one established for lightpush, from the switch.

DBG 2021-05-24 05:35:43.637+00:00 unsubscribing pubsub peer                  topics="libp2p pubsub" tid=6200 peerId=16U*kyNBQr

When the lightpush peer subsequently tries to send a PushRequest it raises an Unhandled exception: Stream Closed! [LPStreamClosedError]

Why doesn't CI fail on macos/ubuntu then?

A bit more tricky, but comes down to timing differences. On both macos and ubuntu the order in which the identify is received by the two peers involved are different from Windows. The effect is that they manage to send the PushRequest just before the connection gets closed by the failing relay negotiation described above.

DBG 2021-05-24 05:13:10.248+00:00 PushRequest                                topics="wakulightpush" tid=76420 pubSubTopic=test msg="(payload: @[104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100], contentTopic: \"/waku/2/default-content/proto\", version: 0, timestamp: 0.0, proof: @[])"
DBG 2021-05-24 05:13:10.248+00:00 publish                                    topics="wakurelay" tid=76420 pubSubTopic=test message="@[10, 11, 104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 18, 29, 47, 119, 97, 107, 117, 47, 50, 47, 100, 101, 102, 97, 117, 108, 116, 45, 99, 111, 110, 116, 101, 110, 116, 47, 112, 114, 111, 116, 111, 24, 0, 33, 0, 0, 0, 0, 0, 0, 0, 0, 170, 1, 0]"
DBG 2021-05-24 05:13:10.248+00:00 Could not establish send connection        topics="libp2p pubsubpeer" tid=76420 msg="Unable to select sub-protocol @[\"/vac/waku/relay/2.0.0-beta2\", \"/vac/waku/relay/2.0.0-beta2\"]"

Subsequent messages would fail. Windows is just too slow to get that lucky message through.

EbubeUd commented 3 years ago

After merging with master in feature/swap-account-logs I now get these set of errors when the CI runs:

DBG 2021-05-24 16:40:29.256+00:00 Processing message                         tid=33954 address=127.0.0.1:55026 line="{\"jsonrpc\":\"2.0\",\"method\":\"waku_hasKeyPair\",\"params\":[\"67c0e2c8de43e9a4a1e2362040a3310f560317ba9cee29e7cea7e4feb46f0a06\"],\"id\":17}"
DBG 2021-05-24 16:40:29.256+00:00 Processing message                         tid=33954 address=127.0.0.1:55026 line="{\"jsonrpc\":\"2.0\",\"method\":\"waku_deleteKeyPair\",\"params\":[\"67c0e2c8de43e9a4a1e2362040a3310f560317ba9cee29e7cea7e4feb46f0a06\"],\"id\":18}"
DBG 2021-05-24 16:40:29.256+00:00 Processing message                         tid=33954 address=127.0.0.1:55026 line="{\"jsonrpc\":\"2.0\",\"method\":\"waku_hasKeyPair\",\"params\":[\"67c0e2c8de43e9a4a1e2362040a3310f560317ba9cee29e7cea7e4feb46f0a06\"],\"id\":19}"
DBG 2021-05-24 16:40:29.256+00:00 Processing message                         tid=33954 address=127.0.0.1:55026 line="{\"jsonrpc\":\"2.0\",\"method\":\"waku_deleteKeyPair\",\"params\":[\"67c0e2c8de43e9a4a1e2362040a3310f560317ba9cee29e7cea7e4feb46f0a06\"],\"id\":20}"
DBG 2021-05-24 16:40:29.256+00:00 Error occurred within RPC                  tid=33954 methodName=waku_deleteKeyPair err="Invalid key id"
  [OK] waku asymKey tests
DBG 2021-05-24 16:40:29.256+00:00 Processing message                         tid=33954 address=127.0.0.1:55026 line="{\"jsonrpc\":\"2.0\",\"method\":\"waku_newSymKey\",\"params\":[],\"id\":21}"
DBG 2021-05-24 16:40:29.256+00:00 Processing message                         tid=33954 address=127.0.0.1:55026 line="{\"jsonrpc\":\"2.0\",\"method\":\"waku_hasSymKey\",\"params\":[\"d68e93c9e1966d569feebe0991face4ac91da803a685eb1f95ba2898cd87147d\"],\"id\":22}"
DBG 2021-05-24 16:40:29.256+00:00 Processing message                         tid=33954 address=127.0.0.1:55026 line="{\"jsonrpc\":\"2.0\",\"method\":\"waku_deleteSymKey\",\"params\":[\"d68e93c9e1966d569feebe0991face4ac91da803a685eb1f95ba2898cd87147d\"],\"id\":23}"
DBG 2021-05-24 16:40:29.257+00:00 Processing message                         tid=33954 address=127.0.0.1:55026 line="{\"jsonrpc\":\"2.0\",\"method\":\"waku_hasSymKey\",\"params\":[\"d68e93c9e1966d569feebe0991face4ac91da803a685eb1f95ba2898cd87147d\"],\"id\":24}"
DBG 2021-05-24 16:40:29.257+00:00 Processing message                         tid=33954 address=127.0.0.1:55026 line="{\"jsonrpc\":\"2.0\",\"method\":\"waku_deleteSymKey\",\"params\":[\"d68e93c9e1966d569feebe0991face4ac91da803a685eb1f95ba2898cd87147d\"],\"id\":25}"
DBG 2021-05-24 16:40:29.257+00:00 Error occurred within RPC                  tid=33954 methodName=waku_deleteSymKey err="Invalid key id"

Could it be related to this issue?

oskarth commented 3 years ago

Thanks @jm-clius, nice catch! I should've been able to figure that out looking at the logs a bit more carefully. The timing issue makes sense too.

jm-clius commented 3 years ago

Could it be related to this issue?

@EbubeUd doesn't seem to be (directly) related. Might have been a temporary resource issue, so you could try to re-run those jobs if you haven't already. In any case, let's move this discussion to that PR then.