RPC Connectivity issues runnning CB in docker container

ziggie1984 commented 1 year ago

I am not sure why the RPC connection always resets itself and CB needs to be restarted in docker but judging from your previous commit where you added the exitCode I guess this is a known problem? (Happens within a time period of 2-3 hours).

 Unexpected exit {"err": "interceptor error: rpc error: code = Unavailable desc = error reading from server: read tc
p 172.22.0.2:49036->192.168.178.200:10009: read: connection reset by peer"}

Seems to be a problem with the docker virtual machine, so only fix so far for me was to reconnect the lnd-client if that happens but maybe there are other options as well?

joostjager commented 1 year ago

The problem has been reported before, but I am not sure what causes it. I wonder whether lnd is also restarting.

But the restart in docker-compose should be automatic after the exit code fix #66. Isn't it restarting automatically?

ziggie1984 commented 1 year ago

No lnd is not restarting in that case. Yeah restart works fine with docker-compose but was kinda interested to solve because counters get reseted all the time, no big deal but just documenting it in case someone else wonders. I will perform some tests but it looks like the whole lnd connection (cb(lnd) <=> lnd) goes done.

flyerbear commented 1 year ago

I am having a similar issue running circuitbreaker without docker using the latest install script after merge of PR #73.

A few times per day (last interval was approximately 2 hours), the server stops with the following errors:

May 16 12:56:10 prodesk circuitbreaker[1884260]: 2023-05-16T12:56:10.429-0700        INFO        Stopping http server
May 16 12:56:10 prodesk systemd[1]: circuitbreaker.service: Main process exited, code=exited, status=1/FAILURE
May 16 12:56:10 prodesk circuitbreaker[1884260]: 2023-05-16T12:56:10.429-0700        INFO        Stopping grpc server
May 16 12:56:10 prodesk circuitbreaker[1884260]: 2023-05-16T12:56:10.429-0700        ERROR        Unexpected exit        {"err": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
May 16 12:56:10 prodesk circuitbreaker[1884260]: main.main
May 16 12:56:10 prodesk circuitbreaker[1884260]:         /home/circuitbreaker/circuitbreaker/main.go:139
May 16 12:56:10 prodesk circuitbreaker[1884260]: runtime.main
May 16 12:56:10 prodesk circuitbreaker[1884260]:         /usr/local/go/src/runtime/proc.go:250
May 16 12:56:10 prodesk systemd[1]: circuitbreaker.service: Failed with result 'exit-code'.
May 16 12:57:10 prodesk systemd[1]: circuitbreaker.service: Scheduled restart job, restart counter is at 1.
May 16 12:57:10 prodesk systemd[1]: Stopped Circuit Breaker.

The service generally tries to restart, as far as I can tell. Sometimes it succeeds and picks up without issue. But other times, it immediately fails again with the same error, or gets through "New peer notification received" steps, and then all logs stop until the service is manually restarted. Should this be a separate issue or is it related?

joostjager commented 1 year ago

@flyerbear can you see what happens in the lnd log at the time the disconnect occurs?

flyerbear commented 1 year ago

This is what lnd shows at the time of the disconnect I pasted in my last comment. I will try to see if it is similar at all disconnects.

May 16 12:56:10 prodesk lnd[1058974]: 2023-05-16 12:56:10.429 [DBG] RRPC: htlc event stream cancelled
May 16 12:56:10 prodesk lnd[1058974]: 2023-05-16 12:56:10.429 [ERR] RPCS: [/routerrpc.Router/HtlcInterceptor]: rpc error: code = Canceled desc = context canceled
May 16 12:56:10 prodesk lnd[1058974]: 2023-05-16 12:56:10.429 [INF] HSWC: Interceptor disconnected, resolving held packets
May 16 12:56:10 prodesk lnd[1058974]: 2023-05-16 12:56:10.429 [ERR] RPCS: [/routerrpc.Router/SubscribeHtlcEvents]: context canceled

This was in the middle of routine connection attempts / ping-pongs to peers.

joostjager commented 1 year ago

Thanks. And for both of these logs, is there anything useful above the copied sections, but still close to 12:56:10.429 ?

flyerbear commented 1 year ago

Not that I can tell. The only possibly relevant thing that is going on is that I have one peer whose node has been down for a very long time (going on 2 weeks). I've kept the channel open because I'm in comms with him and I know he's working on the issue. So the logs have attempts and fails to connect surrounding these events, but those failed attempts also exist throughout the logs. If you think it's relevant I will post. I'm currently collecting the LND logs for all the circuitbreaker failures that have occurred today and will post momentarily. There are slight differences between when CB recovers and when it does not, but I think it may be more CB reporting its status to LND than LND causing a difference to CB.

flyerbear commented 1 year ago

07:58:30 Circuitbreaker failed as before but recovered LND logs:

May 16 07:58:30 prodesk lnd[1058974]: 2023-05-16 07:58:30.179 [DBG] RRPC: htlc event stream cancelled
May 16 07:58:30 prodesk lnd[1058974]: 2023-05-16 07:58:30.179 [ERR] RPCS: [/routerrpc.Router/SubscribeHtlcEvents]: context canceled

09:59:42 Circuitbreaker failed as before, restarted, but only got to "New peer notification received" messages, then logs stop LND logs:

May 16 09:59:42 prodesk lnd[1058974]: 2023-05-16 09:59:42.479 [DBG] RRPC: htlc event stream cancelled
May 16 09:59:42 prodesk lnd[1058974]: 2023-05-16 09:59:42.479 [ERR] RPCS: [/routerrpc.Router/SubscribeHtlcEvents]: context canceled
May 16 09:59:42 prodesk lnd[1058974]: 2023-05-16 09:59:42.479 [ERR] RPCS: [/routerrpc.Router/HtlcInterceptor]: rpc error: code = Canceled desc = context canceled
May 16 09:59:42 prodesk lnd[1058974]: 2023-05-16 09:59:42.479 [INF] HSWC: Interceptor disconnected, resolving held packets

12:56:10 Circuitbreaker failed and attempted restart at 12:57:10, fails again at 12:57:20, attempts restart at 12:58:20, then failed again at 12:58:30. LND logs:

May 16 12:56:10 prodesk lnd[1058974]: 2023-05-16 12:56:10.429 [DBG] RRPC: htlc event stream cancelled
May 16 12:56:10 prodesk lnd[1058974]: 2023-05-16 12:56:10.429 [ERR] RPCS: [/routerrpc.Router/HtlcInterceptor]: rpc error: code = Canceled desc = context canceled
May 16 12:56:10 prodesk lnd[1058974]: 2023-05-16 12:56:10.429 [INF] HSWC: Interceptor disconnected, resolving held packets
May 16 12:56:10 prodesk lnd[1058974]: 2023-05-16 12:56:10.429 [ERR] RPCS: [/routerrpc.Router/SubscribeHtlcEvents]: context canceled
...
May 16 12:57:10 prodesk lnd[1058974]: 2023-05-16 12:57:10.370 [DBG] BTCN: Retrying connection to <failed node>@<failed node ip:port> (reqid xxx) in 1m10s
May 16 12:57:10 prodesk lnd[1058974]: 2023-05-16 12:57:10.580 [DBG] RPCS: [/lnrpc.Lightning/GetChanInfo] requested
May 16 12:57:10 prodesk lnd[1058974]: 2023-05-16 12:57:10.583 [DBG] RPCS: [/lnrpc.Lightning/GetChanInfo] requested
May 16 12:57:10 prodesk lnd[1058974]: 2023-05-16 12:57:10.585 [DBG] RPCS: [/lnrpc.Lightning/GetChanInfo] requested
May 16 12:57:10 prodesk lnd[1058974]: 2023-05-16 12:57:10.587 [DBG] RPCS: [/lnrpc.Lightning/GetChanInfo] requested
May 16 12:57:10 prodesk lnd[1058974]: 2023-05-16 12:57:10.673 [DBG] BTCN: Attempting to connect to <failed node>@<failed node ip:port> (reqid xxx)
May 16 12:57:10 prodesk lnd[1058974]: 2023-05-16 12:57:10.692 [DBG] RPCS: [/lnrpc.Lightning/GetInfo] requested
...

I couldn't find anything relevant in the LND logs at 12:57:20 when CB failed again.

May 16 12:58:20 prodesk lnd[1058974]: 2023-05-16 12:58:20.934 [DBG] RPCS: [/lnrpc.Lightning/GetInfo] requested

Again, I didn't see anything else in the logs at 12:58:30 when CB failed, and only the one RPCS request seemed relevant at 12:58:20. Everything else was channel connections.

14:14:19 CB failed, tried to restart at 14:15:19, failed again, tried to restart at 14:16:29. LND logs: (I included a couple extra lines for context on this one.)

May 16 14:14:19 prodesk lnd[1058974]: 2023-05-16 14:14:19.452 [DBG] DISC: Processing ChannelUpdate: peer=<my_peer>, short_chan_id=<my_chan_id>,
May 16 14:14:19 prodesk lnd[1058974]: 2023-05-16 14:14:19.452 [DBG] DISC: Processing ChannelUpdate: peer=peer=<my_peer>, short_chan_id=<my_chan_id>,
May 16 14:14:19 prodesk lnd[1058974]: 2023-05-16 14:14:19.489 [ERR] RPCS: [/routerrpc.Router/HtlcInterceptor]: rpc error: code = Canceled desc = context canceled
May 16 14:14:19 prodesk lnd[1058974]: 2023-05-16 14:14:19.489 [DBG] RRPC: htlc event stream cancelled
May 16 14:14:19 prodesk lnd[1058974]: 2023-05-16 14:14:19.489 [ERR] RPCS: [/routerrpc.Router/SubscribeHtlcEvents]: context canceled
May 16 14:14:19 prodesk lnd[1058974]: 2023-05-16 14:14:19.489 [INF] HSWC: Interceptor disconnected, resolving held packets
May 16 14:14:19 prodesk lnd[1058974]: 2023-05-16 14:14:19.521 [DBG] BTCN: Attempting to connect to <failed node>@<failed node ip:port> (reqid xxx)

May 16 14:16:29 prodesk lnd[1058974]: 2023-05-16 14:16:29.938 [DBG] RPCS: [/lnrpc.Lightning/GetInfo] requested

I didn't see anything other than "PEER" lines in the LND logs at 14:15:19 / 14:15:29, and only the one line at 14:16:29.

I'm currently running circuitbreaker in a tmux session to see if that works better than this systemd service. Let me know if anything else from me would be helpful - very interested in getting this fixed!

flyerbear commented 1 year ago

Looks like I spoke too soon. It failed in tmux while I was writing up those logs. Here is the output from that session:

2023-05-16T15:08:43.173-0700    INFO    Stopping http server
2023-05-16T15:08:43.173-0700    INFO    Stopping grpc server
2023-05-16T15:08:43.174-0700    ERROR   Unexpected exit {"err": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
main.main
        /home/circuitbreaker/circuitbreaker/main.go:139
runtime.main
        /usr/local/go/src/runtime/proc.go:250

Not sure if this is related, but when circuitbreaker fails, I am unable to access lndg temporarily. I can't tell what brings it back - killing the service (which wasn't relevant in this case with tmux), or just giving it time.

joostjager commented 1 year ago

Thanks for the detailed investigation. My feeling is that there is a timeout somewhere on the stream connection. A robust fix would for CB to just reconnect if that happens rather than exit completely and forget the counters. But also I'm quite interested to find out what exactly causes the disconnect.

Will try to reproduce locally and see what I can do.

Interesting that you mention lndg. The htlc interceptor api of lnd can't handle multiple connections, but not sure if lndg uses it too. Is there anything useful in the lndg log?

flyerbear commented 1 year ago

I don't think lndg uses the htlc interceptor api since, as you mention, it can only handle one connection, and I know of many people who run both. I just dug through the lndg logs and there are some odd errors in there, but I couldn't find anything within ~10 minutes of the times I saw the errors in CB.

flyerbear commented 1 year ago

Ngu confirmed lndg reads the stream and not the interceptor.

flyerbear commented 1 year ago

Continuing to investigate this, this morning I increased fees significantly to try to stop all traffic through the node, and enabled circuitbreaker again. It ran without issue for ~12 hours. I'm going to try lowering fees again and see how it does with htlcs flowing through the node.

flyerbear commented 1 year ago

Looking at the logs, it seems like sometimes systemd can recover CB with a restart, and sometimes it cannot. When circuitbreaker fails, I start to get a huge buildup of incoming htlcs (but not outgoing).

I don't really understand why the auto restart=always in my systemd service can't always recover CB, but a manual sudo systemctl restart circuitbreaker can. Is there a difference in how those two restarts operate? On the last one, the circuitbreaker logs show it restarting one minute after failure, getting through "New peer notification received" for all of my peers, and then no further entries until my manual restart over 6 hours later.

joostjager commented 1 year ago

I am not sure why it doesn't restart.

As for the connection failure, you could try to run with this diff for more debug info: https://github.com/lightningequipment/circuitbreaker/pull/75

joostjager commented 1 year ago

On my machine (mac m1) with lnd and cb running locally (no docker), I didn't have a single disconnect in 24hr. So can't reproduce unfortunately to get those detailed logs.

flyerbear commented 1 year ago

I'm really sorry it took so long to reply on this thread. The issue for me turned out to be hardware related. I am running a mirror zpool of two disks as my data store and one of them was extremely poor performing. It wasn't malfunctioning, just cheap. When enough writes built up and needed to be flushed from the very small SLC, write speed slowed to a crawl and disk utilization and iowait skyrocketed for about 20-30 minutes while the disk tried to catch up. Interestingly cb seemed to be there only casualty of the phenomenon.

I've since replaced the ssd with a better one and haven't had a problem since.

joostjager commented 1 year ago

Thanks for the update @flyerbear

lightningequipment / circuitbreaker

RPC Connectivity issues runnning CB in docker container #70