lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.65k stars 2.07k forks source link

[bug]: `unable to read message from peer: EOF`, disconnections and routing stops #8125

Closed KnockOnWoodNode closed 11 months ago

KnockOnWoodNode commented 11 months ago

Background

It will happen, after N time, that my node starts having channels deactivated to several peers, one after the other, to be reactivated in a few minutes time tops; at the same time, it will become unable to route HTLCs, as I see no more forwards being notified by bos bot, and one of my peers who tracks these things reports a spike in pending outgoing HTLCs from their node to mine, whenever this happens, that will slowly resolve themselves by failing. Restarting lnd solves the issue, until next time this happens.

I couldn't make solid hypotheses about why this happens, but here's all the details that I can provide so you maybe have some ideas of your own. I run sqlite backend, and increased timeout to 10m to avoid SQLITE_BUSY errors. I don't remember this error happening before, but I am 90% sure it started after I began connecting to more peers other than my direct channel ones, to get gossip updates faster from the network (this is before I knew about active sync peers and passive sync peers, before I was connecting to many peers which were all passive, I later on caught up and increased my active peers value, but all of this doesn't seem to have had any influence on the issue). What I seemed to notice, other than seeing this problem arise after I increased the number of peers I connect to, is that the more peers I have, the sooner this happens. Using persistent connections or not doesn't appear to change anything. I attached a log for one node which I picked among the ones my nodes detected disconnections to this last time. I had increased PEER loglevel to debug, and zgrep'd logs for its pubkey. I have since then restored info loglevel for everything.

I have disabled, for the time being, my script that connects to more peers, to be bale to report what happens in the upcoming days.

rocket.log

Your environment

Steps to reproduce

Have sqlite backend (no idea if necessary), have an active routing node with 40something channels, connect to many peers (above 300 for faster mishap) with lncli connect <pubkey>@<address>:<port>

Expected behaviour

lnd continues operating normally, managing forwards like a champ

Actual behaviour

channels are disconnected at random, htlcs are not being processed

bitromortac commented 11 months ago

I would recommend to update to v0.17 or (v0.17.1 which should be out soon if you want to wait) as issues around SQLite and p2p code were fixed. Afterwards it will be easier to debug the issue if it still exists.

KnockOnWoodNode commented 11 months ago

I would recommend to update to v0.17 or (v0.17.1 which should be out soon if you want to wait) as issues around SQLite and p2p code were fixed. Afterwards it will be easier to debug the issue if it still exists.

0.17 contains a few things that I hear have not been fixed, so I'd rather wait until I see no more hiccups reported by plebs, upgrading could open a completely new can of worms while not assuring the solution of the previous one :sweat_smile:

bitromortac commented 11 months ago

It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful: curl http://localhost:PORT/debug/pprof/goroutine?debug=0 > goroutine_0.prof curl http://localhost:PORT/debug/pprof/goroutine?debug=2 > goroutine_2.prof

KnockOnWoodNode commented 11 months ago

It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful: curl http://localhost:PORT/debug/pprof/goroutine?debug=0 > goroutine_0.prof curl http://localhost:PORT/debug/pprof/goroutine?debug=2 > goroutine_2.prof

I remember people talking about ungraceful shutdowns in 0.17 and possibly other things in the plebnet groups, but cannot pinpoint anything precise, I was just left with the knowledge of "0.17=unsafe to use yet".

Regarding the profiling, I have the port enabled already, but I guess it would be useful only when the issue presents itself again, right?

saubyk commented 11 months ago

"0.17=unsafe to use yet".

This is an overreaction IMO. There is nothing inherently "unsafe" with 0.17 release.

KnockOnWoodNode commented 11 months ago

It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful: curl http://localhost:PORT/debug/pprof/goroutine?debug=0 > goroutine_0.prof curl http://localhost:PORT/debug/pprof/goroutine?debug=2 > goroutine_2.prof

Okay, just got the issue again.

This was after removing the "connecting to many more peers" part, but keeping numgraphsyncpeers=200 which meant having 51 active sync peers. So this also happens with more active sync peers.

I have ran the profiling commands you asked and just restarted with numgraphsyncpeers=10 to see if that happens again, let me know if you also want profiling reports for that as well.

goroutine_.zip

KnockOnWoodNode commented 11 months ago

"0.17=unsafe to use yet".

This is an overreaction IMO. There is nothing inherently "unsafe" with 0.17 release.

I am sorry if I caused any hard feelings, that's just my MO with lnd versions and I know now that several plebs are running on it. I will consider 17.1 later though

bitromortac commented 11 months ago

8146 is likely unrelated to this issue, but there are some goroutines blocked on WaitForDisconnect, where some fixes have been made with v0.17, see for example https://github.com/lightningnetwork/lnd/pull/7856. So I'd suggest to wait for running on v0.17.1 and then see if the issue still persists.

KnockOnWoodNode commented 11 months ago

8146 is likely unrelated to this issue, but there are some goroutines blocked on WaitForDisconnect, where some fixes have been made with v0.17, see for example #7856. So I'd suggest to wait for running on v0.17.1 and then see if the issue still persists.

where you couldn't convince me to upgrade, a couple of other plebs succeeded, so I'm running with 0.17 since a couple of days. Nothing weird has happened yet, but I'm back to a normal number of peers and 3 active sync peers, and I'll be increasing those very soon I believe.

Too early to say it's fixed anyway, in the past I've run even for a week before noticing this issue.

Roasbeef commented 11 months ago

This was after removing the "connecting to many more peers" part, but keeping numgraphsyncpeers=200 which meant having 51 active sync peers. So this also happens with more active sync peers.

You don't need to sync gossip from 200 sync peers, it won't help with much, we do fewer peers to reduce the amount of actual gossip bandwidth. It's unclear what your actual issue as, peer may disconnect for various reasons, so that doesn't appear to be abnormal.

If you have 200 gossip sync peers, and they're all sending you data, but processing si slow for w/e reason, then this might mean the processing queue fills up, they blocked on writing, then eventually disconnect.

Looking at the logs, I don't see anything out of the ordinary, but you may want to move PEER and DISC to info so the logs have a bit less spam.

KnockOnWoodNode commented 11 months ago

This was after removing the "connecting to many more peers" part, but keeping numgraphsyncpeers=200 which meant having 51 active sync peers. So this also happens with more active sync peers.

You don't need to sync gossip from 200 sync peers, it won't help with much, we do fewer peers to reduce the amount of actual gossip bandwidth. It's unclear what your actual issue as, peer may disconnect for various reasons, so that doesn't appear to be abnormal.

If you have 200 gossip sync peers, and they're all sending you data, but processing si slow for w/e reason, then this might mean the processing queue fills up, they blocked on writing, then eventually disconnect.

Looking at the logs, I don't see anything out of the ordinary, but you may want to move PEER and DISC to info so the logs have a bit less spam.

the 200 active sync peers or even 47 I've been on for a while, didn't look a determining factor in the issue, because after I went back to the default 3 and stopped increasing number of peers, I encountered the same problem after a couple of days. Since I have a 12-core xeon I think the processing load is not a factor as well, but took the plunge into 0.17 and will be able to report on that now

Roasbeef commented 11 months ago

Processing load may not be a factor, but I/O may very well be.

But can you describe the problem you perceive more precisely? That you have 300 peers, most of which you don't have channels open with, and they disconnect from time to time?

That's totally normal: connections aren't persistent unless you have a channel with them. If you weren't able to keep connections to peers you had a channel with, then that would actually be something to look into. Your channel peers are the only peers that really matter.

Having more connections to other peers you don't have channels with doesn't necessarily help your node, and doesn't help you route any more effectively.

KnockOnWoodNode commented 11 months ago

Processing load may not be a factor, but I/O may very well be.

But can you describe the problem you perceive more precisely? That you have 300 peers, most of which you don't have channels open with, and they disconnect from time to time?

That's totally normal: connections aren't persistent unless you have a channel with them. If you weren't able to keep connections to peers you had a channel with, then that would actually be something to look into. Your channel peers are the only peers that really matter.

Having more connections to other peers you don't have channels with doesn't necessarily help your node, and doesn't help you route any more effectively.

the core of the issue is described in first paragraph of OP (I am on 0.17 now and still have to reproduce it, on 0.16.4 it was like clockwork that it would happen after some time). Several channel peers (regardless of non-channel peers which is normal they would disconnect after a while) would get randomly disconnected and reconnect after a few minutes with the error in OP title, and this "signaled" the issue had started, as all forwarding activity of my node stopped since the first disconnections happened, and until I restarted lnd. One of my channel peers who tracks these stats, reported a high sudden spike of pending outgoing htlcs to me, that slowly receded, all the while no htlcs were actually getting through to me. This happened every 2 days in the last week, and at an interval of about 1-2 weeks before then. During these last days I have also had another event repeat itself twice, which I can't bundle with logs, where I had over 100 pending incoming htlcs (just a handful of outgoing ones) which wouldn't resolve until I restarted lnd.

In the meanwhile I have upgraded and been running 0.17 for 2.5days now with no issue, and I just restarted after having increased the number of active sync peers, to test if the new version has the same issue still.

saubyk commented 11 months ago

Hi @KnockOnWoodNode is this still an issue after upgrading to 0.17?

KnockOnWoodNode commented 11 months ago

Hi @KnockOnWoodNode is this still an issue after upgrading to 0.17?

looks like it is not until now, been running with many active sync peers and no funny business. Still not completely left behind I'd say, but it's looking good

saubyk commented 11 months ago

Thanks for reporting back @KnockOnWoodNode closing this issue now. Please reopen if you observe the issue again and we can investigate.