keep-network / keep-core

The smart contracts and reference client behind the Keep network
https://keep.network
MIT License
118 stars 73 forks source link

Silent panic - concurrent write to websocket #3699

Closed thevops closed 1 year ago

thevops commented 1 year ago

Hello,

I had a problem with my tBTC node. It failed silently but didn't stop completely. Peers/bootstraps connections worked, but Prometheus monitoring (https://monitoring.threshold.network/grafana/public-dashboards/1a09fa3a621c4837988b36f2d6ae6e24?orgId=0) lost him for some time (until I noticed it by viewing the above Prometheus). I also noticed that endpoint /metrics worked after the failure, but /diagnostics did not (infinite loading). In that case, the app should exit with an error code, then my system would start it again (I use containers).

I had to manually restart the node to be discovered by the monitoring to be qualified for rewards.

Logs

2023-08-17 03:21:21.933 {"level":"info","ts":"2023-08-17T01:21:21.933Z","logger":"keep-libp2p","caller":"libp2p/libp2p.go:242","msg":"number of connected peers: [144]"}
2023-08-17 03:20:21.934 {"level":"info","ts":"2023-08-17T01:20:21.933Z","logger":"keep-libp2p","caller":"libp2p/libp2p.go:242","msg":"number of connected peers: [144]"}
2023-08-17 03:19:21.934 {"level":"info","ts":"2023-08-17T01:19:21.934Z","logger":"keep-libp2p","caller":"libp2p/libp2p.go:242","msg":"number of connected peers: [144]"}
2023-08-17 03:18:21.933 {"level":"info","ts":"2023-08-17T01:18:21.933Z","logger":"keep-libp2p","caller":"libp2p/libp2p.go:242","msg":"number of connected peers: [144]"}
2023-08-17 03:17:33.277     /usr/local/go/src/net/http/server.go:3071 +0x4db
2023-08-17 03:17:33.277 created by net/http.(*Server).Serve
2023-08-17 03:17:33.277     /usr/local/go/src/net/http/server.go:1966 +0x5d7
2023-08-17 03:17:33.277 net/http.(*conn).serve(0xc00151b0e0, {0x1d6fb38, 0xc013641020})
2023-08-17 03:17:33.276     /usr/local/go/src/net/http/server.go:2916 +0x43b
2023-08-17 03:17:33.276 net/http.serverHandler.ServeHTTP({0xc001849200?}, {0x1d6f010, 0xc0017162a0}, 0xc001865b00)
2023-08-17 03:17:33.276     /usr/local/go/src/net/http/server.go:2462 +0x149
2023-08-17 03:17:33.276 net/http.(*ServeMux).ServeHTTP(0x0?, {0x1d6f010, 0xc0017162a0}, 0xc001865b00)
2023-08-17 03:17:33.276     /usr/local/go/src/net/http/server.go:2084 +0x2f
2023-08-17 03:17:33.276 net/http.HandlerFunc.ServeHTTP(0x7f2ab35d92e8?, {0x1d6f010?, 0xc0017162a0?}, 0x410a65?)
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/keep-network/keep-common@v1.7.1-0.20230501122407-5cc3757ccf0d/pkg/clientinfo/clientinfo.go:61 +0x39
2023-08-17 03:17:33.276 github.com/keep-network/keep-common/pkg/clientinfo.(*Registry).EnableServer.func2({0x1d6f010?, 0xc0017162a0}, 0xc243e206bf9a?)
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/keep-network/keep-common@v1.7.1-0.20230501122407-5cc3757ccf0d/pkg/clientinfo/diagnostics.go:26 +0x24e
2023-08-17 03:17:33.276 github.com/keep-network/keep-common/pkg/clientinfo.(*Registry).exposeDiagnostics(0xc0139a8b00)
2023-08-17 03:17:33.276     /go/src/github.com/keep-network/keep-core/pkg/clientinfo/diagnostics.go:135 +0x26
2023-08-17 03:17:33.276 github.com/keep-network/keep-core/pkg/clientinfo.(*Registry).RegisterBtcChainInfoSource.func1()
2023-08-17 03:17:33.276     /go/src/github.com/keep-network/keep-core/pkg/bitcoin/electrum/electrum.go:259 +0x2e
2023-08-17 03:17:33.276 github.com/keep-network/keep-core/pkg/bitcoin/electrum.(*Connection).GetLatestBlockHeight(0x4?)
2023-08-17 03:17:33.276     /go/src/github.com/keep-network/keep-core/pkg/bitcoin/electrum/electrum.go:763 +0x85
2023-08-17 03:17:33.276 github.com/keep-network/keep-core/pkg/bitcoin/electrum.requestWithRetry[...](0x18?, 0x14bd160?)
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/keep-network/keep-common@v1.7.1-0.20230501122407-5cc3757ccf0d/pkg/wrappers/wrappers.go:76
2023-08-17 03:17:33.276 github.com/keep-network/keep-common/pkg/wrappers.DoWithDefaultRetry(...)
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/keep-network/keep-common@v1.7.1-0.20230501122407-5cc3757ccf0d/pkg/wrappers/wrappers.go:34 +0xff
2023-08-17 03:17:33.276 github.com/keep-network/keep-common/pkg/wrappers.DoWithRetry({0x1d6fac8?, 0xc000048028?}, 0x3b9aca00, 0x0?, 0x4?, 0xc0131bd7a8)
2023-08-17 03:17:33.276     /go/src/github.com/keep-network/keep-core/pkg/bitcoin/electrum/electrum.go:775 +0xfb
2023-08-17 03:17:33.276 github.com/keep-network/keep-core/pkg/bitcoin/electrum.requestWithRetry[...].func1()
2023-08-17 03:17:33.276     /go/src/github.com/keep-network/keep-core/pkg/bitcoin/electrum/electrum.go:262 +0x36
2023-08-17 03:17:33.276 github.com/keep-network/keep-core/pkg/bitcoin/electrum.(*Connection).GetLatestBlockHeight.func1({0x1d6fb00?, 0xc001cb5aa0?}, 0xc001e336b0?)
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/keep-network/go-electrum@v0.0.0-20230524074410-befe891c2f8c/electrum/subscribe.go:31 +0x7b
2023-08-17 03:17:33.276 github.com/checksum0/go-electrum/electrum.(*Client).SubscribeHeaders(0xc013a16700, {0x1d6fb00, 0xc001cb5aa0})
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/keep-network/go-electrum@v0.0.0-20230524074410-befe891c2f8c/electrum/network.go:296 +0x1db
2023-08-17 03:17:33.276 github.com/checksum0/go-electrum/electrum.(*Client).request(0xc013a16700, {0x1d6fb00, 0xc001cb5aa0}, {0x19017d1, 0x1c}, {0x2afa898, 0x0, 0x0}, {0x142f0e0, 0xc001b9fd58})
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/keep-network/go-electrum@v0.0.0-20230524074410-befe891c2f8c/electrum/transport_ws.go:91 +0x16a
2023-08-17 03:17:33.276 github.com/checksum0/go-electrum/electrum.(*WebSocketTransport).SendMessage(0x16069e0?, {0xc0135129c0?, 0x459b6d?, 0xc001abc000?})
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:773 +0x152
2023-08-17 03:17:33.276 github.com/gorilla/websocket.(*Conn).WriteMessage(0x2ac6ba0?, 0xc0135129c0?, {0xc0135129c0, 0x40, 0x40})
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:520 +0x45
2023-08-17 03:17:33.276 github.com/gorilla/websocket.(*Conn).NextWriter(0xc0136bde40, 0x1)
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:480 +0x42
2023-08-17 03:17:33.276 github.com/gorilla/websocket.(*Conn).beginMessage(0xc0136bde40, 0xc0138aef90, 0x1)
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:731 +0x45
2023-08-17 03:17:33.276 github.com/gorilla/websocket.(*messageWriter).Close(0x0?)
2023-08-17 03:17:33.276     /go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:617 +0x52b
2023-08-17 03:17:33.276 github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc013526c90, 0x1, {0x0?, 0x0?, 0x0?})
2023-08-17 03:17:33.276     /usr/local/go/src/runtime/panic.go:844 +0x258
2023-08-17 03:17:33.276 panic({0x1484a00, 0x1d5f640})
2023-08-17 03:17:33.276     /usr/local/go/src/net/http/server.go:1825 +0xbf
2023-08-17 03:17:33.276 net/http.(*conn).serve.func1()
2023-08-17 03:17:33.276 goroutine 1178757 [running]:
2023-08-17 03:17:33.276 2023/08/17 01:17:33 http: panic serving 10.140.1.234:55572: concurrent write to websocket connection
2023-08-17 03:17:21.934 {"level":"info","ts":"2023-08-17T01:17:21.934Z","logger":"keep-libp2p","caller":"libp2p/libp2p.go:242","msg":"number of connected peers: [144]"}
2023-08-17 03:16:21.934 {"level":"info","ts":"2023-08-17T01:16:21.934Z","logger":"keep-libp2p","caller":"libp2p/libp2p.go:242","msg":"number of connected peers: [144]"}
2023-08-17 03:15:21.933 {"level":"info","ts":"2023-08-17T01:15:21.933Z","logger":"keep-libp2p","caller":"libp2p/libp2p.go:242","msg":"number of connected peers: [144]"}
### Tasks
- [ ] https://github.com/keep-network/keep-core/pull/3701
- [ ] https://github.com/keep-network/keep-core/pull/3715
pdyraga commented 1 year ago

Probably related to https://github.com/checksum0/go-electrum/issues/10.

thevops commented 1 year ago

Occurred once more 😞

2023/08/25 15:49:20 http: panic serving 10.121.1.207:35180: concurrent write to websocket connection
panic({0x1484a00, 0x1d5f640})
    /usr/local/go/src/runtime/panic.go:844 +0x258
randyramig commented 1 year ago

I'm seeing the exact same thing. And my uptime went from 99% to 86%. I've left my keep client in this state (because I'm now screwed for monthly rewards) if you need to debug.

blitmore commented 1 year ago

Node 18.118.184.107 is exhibiting the same behaviour - no 'diagnostics' response while 'metrics' does respond. Additionally, note it will lose rewards due to running v2.0.0-m4 and reporting it has no connectivity to the electrum node.