skycoin / skywire

Skywire Node implementation
73 stars 45 forks source link

dmsg client fails to accept stream with error="dmsg error 401 - listener accept chan maxed" #1803

Open 0pcom opened 2 months ago

0pcom commented 2 months ago

When performing the following test of the transport setup-node functionality (which also performs a /health check of the visor over dmsghttp using dmsgweb):

$ _mypk="$(skywire cli visor pk)" ; skywire cli ut -o | while read _pk ; do echo $_pk ;  curl -sLx socks5h://127.0.0.1:4445 http://${_pk}.dmsg/health ; echo ; skywire svc tps add  -t dmsg -1 $_pk -2 ${_mypk} ; _tpid="$(skywire svc tps list -1 $_pk | jq --arg mypk "${_mypk}" '.[] | select(.Remote == $mypk) | .ID' | tr -d '"')" ; [[ ${_tpid} != "" ]] && skywire svc tps rm -1 $_pk -i $_tpid ; done 

Long before the conclusion of the test, the visor enters a state where it can no longer accept transports.

[2024-04-08T22:43:11.528770696-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=0326978f5a53aff537dbb47fed58b1f123af3b00132d365f1309a14db4168dcff7
[2024-04-08T22:44:47.398367757-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=03d5b55d1133b26485c664cf8b95cff6746d1e321c34e48c9fed293eff0d6d49e5
[2024-04-08T22:46:23.231961842-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=0326978f5a53aff537dbb47fed58b1f123af3b00132d365f1309a14db4168dcff7
[2024-04-08T22:47:14.428792897-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=03d5b55d1133b26485c664cf8b95cff6746d1e321c34e48c9fed293eff0d6d49e5
[2024-04-08T22:47:19.494423284-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=0326978f5a53aff537dbb47fed58b1f123af3b00132d365f1309a14db4168dcff7
[2024-04-08T22:47:30.993828517-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=0326978f5a53aff537dbb47fed58b1f123af3b00132d365f1309a14db4168dcff7
[2024-04-08T22:47:37.453425644-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=0326978f5a53aff537dbb47fed58b1f123af3b00132d365f1309a14db4168dcff7
[2024-04-08T22:48:04.689101587-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=03d5b55d1133b26485c664cf8b95cff6746d1e321c34e48c9fed293eff0d6d49e5
[2024-04-08T22:48:47.932088037-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=03d5b55d1133b26485c664cf8b95cff6746d1e321c34e48c9fed293eff0d6d49e5
[2024-04-08T22:48:54.133752329-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=03d5b55d1133b26485c664cf8b95cff6746d1e321c34e48c9fed293eff0d6d49e5
[2024-04-08T22:49:01.558147019-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=0326978f5a53aff537dbb47fed58b1f123af3b00132d365f1309a14db4168dcff7
[2024-04-08T22:50:49.603865602-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=0326978f5a53aff537dbb47fed58b1f123af3b00132d365f1309a14db4168dcff7

Oddly, it is still possible to create transports to other visors; it's simply not possible to accept them or for remote visors to create a transport to the local visor

Here I check a public key from the end of the above test to see if it's online (it is)

$ skywire-cli ut -ok 03fecb78a0c963ea09ce5729b8660adfe2f5126d88e9f8ed8d92e1ed3a9c48ce80
03fecb78a0c963ea09ce5729b8660adfe2f5126d88e9f8ed8d92e1ed3a9c48ce80

I add and remove a transport to that visor from the local visor as a test - it works

[user@linux ~]$ skywire-cli tp add -t dmsg 03fecb78a0c963ea09ce5729b8660adfe2f5126d88e9f8ed8d92e1ed3a9c48ce80
[2024-04-09T08:31:59.681871589-05:00] INFO [skywire-cli]: Established dmsg transport to 03fecb78a0c963ea09ce5729b8660adfe2f5126d88e9f8ed8d92e1ed3a9c48ce80
type     id                                       remote_pk                                                              mode        label
dmsg     f70fb2bf-eb8d-0686-943a-42f793ebdd5a     03fecb78a0c963ea09ce5729b8660adfe2f5126d88e9f8ed8d92e1ed3a9c48ce80     regular     user
[user@linux ~]$ skywire cli tp rm -i f70fb2bf-eb8d-0686-943a-42f793ebdd5a
OK

Then I attempt to create a transport from the remote visor to my local visor with the transport setup-node

[user@linux ~]$ skywire svc tps add -t dmsg -1 03fecb78a0c963ea09ce5729b8660adfe2f5126d88e9f8ed8d92e1ed3a9c48ce80 -2 $(skywire cli visor pk)
2024/04/09 08:32:59 Error occurred: unexpected HTTP response status: 500 Internal Server Error

it fails with the following visor debug logging

[2024-04-09T08:32:58.008434781-05:00] DEBUG [dmsgC]: Failed to accept stream. error="dmsg error 401 - listener accept chan maxed" session=03d5b55d1133b26485c664cf8b95cff6746d1e321c34e48c9fed293eff0d6d49e5

Conclusion

It's clear there is some issue with either too many open connections or connections not getting closed properly.

0pcom commented 2 months ago

Reproducing the issue

i've devised a simple way to reproduce the issue here.

create a transport to the same visor via dmsg

skywire cli tp add -t dmsg $(skywire-cli visor pk)

Copy the transport ID in place of the existing transport id in the following shell command

 _mypk="$(skywire cli visor pk)"; for ((i=1; i<=10000; i++)); do skywire cli tp add -t dmsg $_mypk  ; skywire cli tp rm -i 5bacb16f-79c4-0d39-aaff-9a82ced5192a ; done

Before all the loops have completed, the visor is no longer able to accept transports. It may take over an hour to reach this state.

in one such test, after creating and removing the transport 2834 times, it was no longer possible for the visor to accept transports. on another test, it happened in fewer than 300 loops. In another test, < 700 loops. It does not appear to happen consistently at the same number of loops but it does always happen eventually

0pcom commented 2 months ago
 _mypk="$(skywire cli visor pk)"; for ((i=1; i<=10000; i++)); do skywire cli tp add -t dmsg $_mypk  ; skywire cli tp rm -i 5bacb16f-79c4-0d39-aaff-9a82ced5192a ; done

Running the same test again after having increased the accept buffer size to 50 from 20.

image

The listener accept chan maxed error no longer occurs but the visor does indeed fail to accept transports around 334 cycles of creating and deleting the same transport.

$ skywire cli tp add -t dmsg 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c
[2024-04-11T09:36:49.515404242-05:00] FATAL [skywire-cli]: Failed to establish dmsg transport: save transport: settlement handshake failed: context deadline exceeded

After re-building to develop branch, the error is now consistently context deadline exceeded

here is the visor debug logging of that


[2024-04-11T10:09:02.326362998-05:00] WARN [tp:032327]: Failed to read packet, closing transport error="EOF" src="read_loop"
[2024-04-11T10:09:02.455373695-05:00] DEBUG readPacket [tp:032327]: Failed to read packet header. error="EOF"
[2024-04-11T10:09:02.455475815-05:00] WARN [tp:032327]: Failed to read packet, closing transport error="EOF" src="read_loop"
[2024-04-11T10:09:02.455538763-05:00] DEBUG [tp:032327]: Stopped log loop
[2024-04-11T10:09:02.455574722-05:00] DEBUG [tp:032327]: Stopped serving. remaining_tps=0 remote_pk=0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c tp_id=5bacb16f-79c4-0d39-aaff-9a82ced5192a tp_index=1
[2024-04-11T10:09:02.79966624-05:00] DEBUG [tp:032327]: Error deleting transport error=<nil> tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:02.98732514-05:00] DEBUG [visor]: Saving transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:09:02.987384171-05:00] DEBUG [transport_manager]: Initializing TP with ID 5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:02.987511665-05:00] DEBUG [transport_manager]: Dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:09:03.284296581-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:03.634831139-05:00] DEBUG [transport_manager]: recv transport request: type(dmsg) remote(0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c)
[2024-04-11T10:09:03.634883247-05:00] DEBUG [transport_manager]: No TP found, creating new one
[2024-04-11T10:09:03.635016452-05:00] DEBUG [tp:032327]: Performing settlement handshake...
[2024-04-11T10:09:03.63507237-05:00] DEBUG [tp:032327]: Serving. remote_pk=0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c tp_id=5bacb16f-79c4-0d39-aaff-9a82ced5192a tp_index=1
[2024-04-11T10:09:23.635252364-05:00] WARN [transport_manager]: Failed to accept transport error="settlement handshake failed: context deadline exceeded"
[2024-04-11T10:09:23.635387961-05:00] ERROR [tp:032327]: Failed to register transport. error="Post "http://tpd.skywire.skycoin.com/transports/": context deadline exceeded"
[2024-04-11T10:09:23.76726204-05:00] DEBUG [transport_manager]: Error dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg: settlement handshake failed: context deadline exceeded
[2024-04-11T10:09:24.261465648-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:24.409361031-05:00] WARN [tp:032327]: Failed to read packet, closing transport error="transport is no longer being served" src="read_loop"
[2024-04-11T10:09:24.409436683-05:00] DEBUG [tp:032327]: Stopped log loop
[2024-04-11T10:09:24.409472768-05:00] DEBUG [tp:032327]: Stopped serving. remaining_tps=0 remote_pk=0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c tp_id=5bacb16f-79c4-0d39-aaff-9a82ced5192a tp_index=1
[2024-04-11T10:09:24.884763603-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:25.075319993-05:00] DEBUG [visor]: Saving transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:09:25.075377552-05:00] DEBUG [transport_manager]: Initializing TP with ID 5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:25.075463594-05:00] DEBUG [transport_manager]: Dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:09:45.856649649-05:00] DEBUG [transport_manager]: Error dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg: settlement handshake failed: context deadline exceeded
[2024-04-11T10:09:46.324248648-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:46.616319338-05:00] DEBUG [visor]: Saving transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:09:46.6163568-05:00] DEBUG [transport_manager]: Initializing TP with ID 5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:09:46.616447276-05:00] DEBUG [transport_manager]: Dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:10:07.37973046-05:00] DEBUG [transport_manager]: Error dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg: settlement handshake failed: context deadline exceeded
[2024-04-11T10:10:07.870581546-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:10:08.158613432-05:00] DEBUG [visor]: Saving transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:10:08.158660373-05:00] DEBUG [transport_manager]: Initializing TP with ID 5bacb16f-79c4-0d39-aaff-9a82ced5192a
[2024-04-11T10:10:08.158756083-05:00] DEBUG [transport_manager]: Dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg
[2024-04-11T10:10:28.937420822-05:00] DEBUG [transport_manager]: Error dialing transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c via dmsg: settlement handshake failed: context deadline exceeded
[2024-04-11T10:10:29.411792439-05:00] DEBUG [tp:032327]: Error deleting transport error="404 Not Found: {"error":"transport not found"}" tp-id=5bacb16f-79c4-0d39-aaff-9a82ced5192a

It seems in this instance the errors begin with failure to register the transport with transport discovery, instead of failure to accept the transport

[2024-04-11T10:09:23.635387961-05:00] ERROR [tp:032327]: Failed to register transport. error="Post "http://tpd.skywire.skycoin.com/transports/": context deadline exceeded"

will recommence the tests with a brief sleep between transport creation attempts

0pcom commented 2 months ago

When a 2 second sleep is added to the loop, it seems the issue no longer manifests - the test is ongoing. ~3560 loops in without the error.

However; this does not prevent any given visor from being inundated with transport requests which might trigger the listener accept chan maxed error from which there is no apparent recovery. Further testing is needed.

Update

The visor did stop being able to accept transports after the 4427th loop with listener accept chan maxed

[2024-04-11T16:25:34.967980343-05:00] INFO [skywire-cli]: Established dmsg transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c
type     id                                       remote_pk                                                              mode        label
dmsg     5bacb16f-79c4-0d39-aaff-9a82ced5192a     0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c     regular     user
OK
4426
[2024-04-11T16:25:39.298518738-05:00] INFO [skywire-cli]: Established dmsg transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c
type     id                                       remote_pk                                                              mode        label
dmsg     5bacb16f-79c4-0d39-aaff-9a82ced5192a     0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c     regular     user
OK
4427
[2024-04-11T16:26:02.88069327-05:00] INFO [skywire-cli]: Established dmsg transport to 0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c
type     id                                       remote_pk                                                              mode        label
dmsg     5bacb16f-79c4-0d39-aaff-9a82ced5192a     0323272a60895f56aad82cb767fb5c413807adcf7c9fb0578b1b1c5807c7f29d4c     regular     user
OK
4428
[2024-04-11T16:26:26.945148546-05:00] FATAL [skywire-cli]: Failed to establish dmsg transport: save transport: settlement handshake failed: context deadline exceeded
OK
4429
[2024-04-11T16:26:50.509788975-05:00] FATAL [skywire-cli]: Failed to establish dmsg transport: save transport: settlement handshake failed: context deadline exceeded
OK
4430
[2024-04-11T16:27:14.073870926-05:00] FATAL [skywire-cli]: Failed to establish dmsg transport: save transport: settlement handshake failed: context deadline exceeded
OK
4431

i don't have the visor debug logging of the exact moment this occurred because it's beyond the scroll back history of the terminal.

The visor is currently unable to accept transports, though the test ended hours ago

0pcom commented 2 months ago

The non-transportability aspect of this issue was addressed in a temporary way in #1807