Open 0pcom opened 2 months ago
i've devised a simple way to reproduce the issue here.
create a transport to the same visor via dmsg
skywire cli tp add -t dmsg $(skywire-cli visor pk)
Copy the transport ID in place of the existing transport id in the following shell command
_mypk="$(skywire cli visor pk)"; for ((i=1; i<=10000; i++)); do skywire cli tp add -t dmsg $_mypk ; skywire cli tp rm -i 5bacb16f-79c4-0d39-aaff-9a82ced5192a ; done
Before all the loops have completed, the visor is no longer able to accept transports. It may take over an hour to reach this state.
in one such test, after creating and removing the transport 2834 times, it was no longer possible for the visor to accept transports. on another test, it happened in fewer than 300 loops. In another test, < 700 loops. It does not appear to happen consistently at the same number of loops but it does always happen eventually
_mypk="$(skywire cli visor pk)"; for ((i=1; i<=10000; i++)); do skywire cli tp add -t dmsg $_mypk ; skywire cli tp rm -i 5bacb16f-79c4-0d39-aaff-9a82ced5192a ; done
Running the same test again after having increased the accept buffer size to 50 from 20.
The listener accept chan maxed
error no longer occurs but the visor does indeed fail to accept transports around 334 cycles of creating and deleting the same transport.
$ skywire cli tp add -t dmsg 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c
[2024-04-11T09:36:49.515404242-05:00] FATAL [skywire-cli]: Failed to establish dmsg transport: save transport: settlement handshake failed: context deadline exceeded
After re-building to develop branch, the error is now consistently context deadline exceeded
here is the visor debug logging of that
[2024-04-11T10:09:02.326362998-05:00] WARN [tp:032327]: Failed to read packet, closing transport error="EOF" src="read_loop"
[2024-04-11T10:09:02.455373695-05:00] DEBUG readPacket [tp:032327]: Failed to read packet header. error="EOF"
[2024-04-11T10:09:02.455475815-05:00] WARN [tp:032327]: Failed to read packet, closing transport error="EOF" src="read_loop"
[2024-04-11T10:09:02.455538763-05:00] DEBUG [tp:032327]: Stopped log loop
[2024-04-11T10:09:02.455574722-05:00] DEBUG [tp:032327]: Stopped serving. remaining_tps=0 remote_pk=0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c tp_id=5bacb16f-79c4-0d39-aaff-9a82ced5192a tp_index=1
[2024-04-11T10:09:02.79966624-05:00] DEBUG [tp:032327]: Error deleting transport error=<nil> tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:02.98732514-05:00] DEBUG [visor]: Saving transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:09:02.987384171-05:00] DEBUG [transport_manager]: Initializing TP with ID 5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:02.987511665-05:00] DEBUG [transport_manager]: Dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:09:03.284296581-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:03.634831139-05:00] DEBUG [transport_manager]: recv transport request: type(dmsg) remote(0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c)
[2024-04-11T10:09:03.634883247-05:00] DEBUG [transport_manager]: No TP found, creating new one
[2024-04-11T10:09:03.635016452-05:00] DEBUG [tp:032327]: Performing settlement handshake...
[2024-04-11T10:09:03.63507237-05:00] DEBUG [tp:032327]: Serving. remote_pk=0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c tp_id=5bacb16f-79c4-0d39-aaff-9a82ced5192a tp_index=1
[2024-04-11T10:09:23.635252364-05:00] WARN [transport_manager]: Failed to accept transport error="settlement handshake failed: context deadline exceeded"
[2024-04-11T10:09:23.635387961-05:00] ERROR [tp:032327]: Failed to register transport. error="Post "http://tpd.skywire.skycoin.com/transports/": context deadline exceeded"
[2024-04-11T10:09:23.76726204-05:00] DEBUG [transport_manager]: Error dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg: settlement handshake failed: context deadline exceeded
[2024-04-11T10:09:24.261465648-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:24.409361031-05:00] WARN [tp:032327]: Failed to read packet, closing transport error="transport is no longer being served" src="read_loop"
[2024-04-11T10:09:24.409436683-05:00] DEBUG [tp:032327]: Stopped log loop
[2024-04-11T10:09:24.409472768-05:00] DEBUG [tp:032327]: Stopped serving. remaining_tps=0 remote_pk=0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c tp_id=5bacb16f-79c4-0d39-aaff-9a82ced5192a tp_index=1
[2024-04-11T10:09:24.884763603-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:25.075319993-05:00] DEBUG [visor]: Saving transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:09:25.075377552-05:00] DEBUG [transport_manager]: Initializing TP with ID 5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:25.075463594-05:00] DEBUG [transport_manager]: Dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:09:45.856649649-05:00] DEBUG [transport_manager]: Error dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg: settlement handshake failed: context deadline exceeded
[2024-04-11T10:09:46.324248648-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:46.616319338-05:00] DEBUG [visor]: Saving transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:09:46.6163568-05:00] DEBUG [transport_manager]: Initializing TP with ID 5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:46.616447276-05:00] DEBUG [transport_manager]: Dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:10:07.37973046-05:00] DEBUG [transport_manager]: Error dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg: settlement handshake failed: context deadline exceeded
[2024-04-11T10:10:07.870581546-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:10:08.158613432-05:00] DEBUG [visor]: Saving transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:10:08.158660373-05:00] DEBUG [transport_manager]: Initializing TP with ID 5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:10:08.158756083-05:00] DEBUG [transport_manager]: Dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:10:28.937420822-05:00] DEBUG [transport_manager]: Error dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg: settlement handshake failed: context deadline exceeded
[2024-04-11T10:10:29.411792439-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
It seems in this instance the errors begin with failure to register the transport with transport discovery, instead of failure to accept the transport
[2024-04-11T10:09:23.635387961-05:00] ERROR [tp:032327]: Failed to register transport. error="Post "http://tpd.skywire.skycoin.com/transports/": context deadline exceeded"
will recommence the tests with a brief sleep between transport creation attempts
When a 2 second sleep is added to the loop, it seems the issue no longer manifests - the test is ongoing. ~3560 loops in without the error.
However; this does not prevent any given visor from being inundated with transport requests which might trigger the listener accept chan maxed
error from which there is no apparent recovery. Further testing is needed.
The visor did stop being able to accept transports after the 4427th loop with listener accept chan maxed
[2024-04-11T16:25:34.967980343-05:00] INFO [skywire-cli]: Established dmsg transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c
type id remote_pk mode label
dmsg 5bacb16f-79c4-0d39-aaff-9a82ced5192a 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c regular user
OK
4426
[2024-04-11T16:25:39.298518738-05:00] INFO [skywire-cli]: Established dmsg transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c
type id remote_pk mode label
dmsg 5bacb16f-79c4-0d39-aaff-9a82ced5192a 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c regular user
OK
4427
[2024-04-11T16:26:02.88069327-05:00] INFO [skywire-cli]: Established dmsg transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c
type id remote_pk mode label
dmsg 5bacb16f-79c4-0d39-aaff-9a82ced5192a 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c regular user
OK
4428
[2024-04-11T16:26:26.945148546-05:00] FATAL [skywire-cli]: Failed to establish dmsg transport: save transport: settlement handshake failed: context deadline exceeded
OK
4429
[2024-04-11T16:26:50.509788975-05:00] FATAL [skywire-cli]: Failed to establish dmsg transport: save transport: settlement handshake failed: context deadline exceeded
OK
4430
[2024-04-11T16:27:14.073870926-05:00] FATAL [skywire-cli]: Failed to establish dmsg transport: save transport: settlement handshake failed: context deadline exceeded
OK
4431
i don't have the visor debug logging of the exact moment this occurred because it's beyond the scroll back history of the terminal.
The visor is currently unable to accept transports, though the test ended hours ago
The non-transportability aspect of this issue was addressed in a temporary way in #1807
When performing the following test of the transport setup-node functionality (which also performs a
/health
check of the visor over dmsghttp usingdmsgweb
):Long before the conclusion of the test, the visor enters a state where it can no longer accept transports.
Oddly, it is still possible to create transports to other visors; it's simply not possible to accept them or for remote visors to create a transport to the local visor
Here I check a public key from the end of the above test to see if it's online (it is)
I add and remove a transport to that visor from the local visor as a test - it works
Then I attempt to create a transport from the remote visor to my local visor with the transport setup-node
it fails with the following visor debug logging
Conclusion
It's clear there is some issue with either too many open connections or connections not getting closed properly.