Closed KnockOnWoodNode closed 11 months ago
I would recommend to update to v0.17 or (v0.17.1 which should be out soon if you want to wait) as issues around SQLite and p2p code were fixed. Afterwards it will be easier to debug the issue if it still exists.
I would recommend to update to v0.17 or (v0.17.1 which should be out soon if you want to wait) as issues around SQLite and p2p code were fixed. Afterwards it will be easier to debug the issue if it still exists.
0.17 contains a few things that I hear have not been fixed, so I'd rather wait until I see no more hiccups reported by plebs, upgrading could open a completely new can of worms while not assuring the solution of the previous one :sweat_smile:
It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful:
curl http://localhost:PORT/debug/pprof/goroutine?debug=0 > goroutine_0.prof
curl http://localhost:PORT/debug/pprof/goroutine?debug=2 > goroutine_2.prof
It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful:
curl http://localhost:PORT/debug/pprof/goroutine?debug=0 > goroutine_0.prof
curl http://localhost:PORT/debug/pprof/goroutine?debug=2 > goroutine_2.prof
I remember people talking about ungraceful shutdowns in 0.17 and possibly other things in the plebnet groups, but cannot pinpoint anything precise, I was just left with the knowledge of "0.17=unsafe to use yet".
Regarding the profiling, I have the port enabled already, but I guess it would be useful only when the issue presents itself again, right?
"0.17=unsafe to use yet".
This is an overreaction IMO. There is nothing inherently "unsafe" with 0.17 release.
It would be great if you could encourage other operators to report those problems here as well if they were not reported yet. Could you give some goroutine dumps like described here? These files would be helpful:
curl http://localhost:PORT/debug/pprof/goroutine?debug=0 > goroutine_0.prof
curl http://localhost:PORT/debug/pprof/goroutine?debug=2 > goroutine_2.prof
Okay, just got the issue again.
This was after removing the "connecting to many more peers" part, but keeping numgraphsyncpeers=200
which meant having 51 active sync peers. So this also happens with more active sync peers.
I have ran the profiling commands you asked and just restarted with numgraphsyncpeers=10
to see if that happens again, let me know if you also want profiling reports for that as well.
"0.17=unsafe to use yet".
This is an overreaction IMO. There is nothing inherently "unsafe" with 0.17 release.
I am sorry if I caused any hard feelings, that's just my MO with lnd versions and I know now that several plebs are running on it. I will consider 17.1 later though
WaitForDisconnect
, where some fixes have been made with v0.17, see for example https://github.com/lightningnetwork/lnd/pull/7856. So I'd suggest to wait for running on v0.17.1 and then see if the issue still persists.8146 is likely unrelated to this issue, but there are some goroutines blocked on
WaitForDisconnect
, where some fixes have been made with v0.17, see for example #7856. So I'd suggest to wait for running on v0.17.1 and then see if the issue still persists.
where you couldn't convince me to upgrade, a couple of other plebs succeeded, so I'm running with 0.17 since a couple of days. Nothing weird has happened yet, but I'm back to a normal number of peers and 3 active sync peers, and I'll be increasing those very soon I believe.
Too early to say it's fixed anyway, in the past I've run even for a week before noticing this issue.
This was after removing the "connecting to many more peers" part, but keeping numgraphsyncpeers=200 which meant having 51 active sync peers. So this also happens with more active sync peers.
You don't need to sync gossip from 200 sync peers, it won't help with much, we do fewer peers to reduce the amount of actual gossip bandwidth. It's unclear what your actual issue as, peer may disconnect for various reasons, so that doesn't appear to be abnormal.
If you have 200 gossip sync peers, and they're all sending you data, but processing si slow for w/e reason, then this might mean the processing queue fills up, they blocked on writing, then eventually disconnect.
Looking at the logs, I don't see anything out of the ordinary, but you may want to move PEER
and DISC
to info
so the logs have a bit less spam.
This was after removing the "connecting to many more peers" part, but keeping numgraphsyncpeers=200 which meant having 51 active sync peers. So this also happens with more active sync peers.
You don't need to sync gossip from 200 sync peers, it won't help with much, we do fewer peers to reduce the amount of actual gossip bandwidth. It's unclear what your actual issue as, peer may disconnect for various reasons, so that doesn't appear to be abnormal.
If you have 200 gossip sync peers, and they're all sending you data, but processing si slow for w/e reason, then this might mean the processing queue fills up, they blocked on writing, then eventually disconnect.
Looking at the logs, I don't see anything out of the ordinary, but you may want to move
PEER
andDISC
toinfo
so the logs have a bit less spam.
the 200 active sync peers or even 47 I've been on for a while, didn't look a determining factor in the issue, because after I went back to the default 3 and stopped increasing number of peers, I encountered the same problem after a couple of days. Since I have a 12-core xeon I think the processing load is not a factor as well, but took the plunge into 0.17 and will be able to report on that now
Processing load may not be a factor, but I/O may very well be.
But can you describe the problem you perceive more precisely? That you have 300 peers, most of which you don't have channels open with, and they disconnect from time to time?
That's totally normal: connections aren't persistent unless you have a channel with them. If you weren't able to keep connections to peers you had a channel with, then that would actually be something to look into. Your channel peers are the only peers that really matter.
Having more connections to other peers you don't have channels with doesn't necessarily help your node, and doesn't help you route any more effectively.
Processing load may not be a factor, but I/O may very well be.
But can you describe the problem you perceive more precisely? That you have 300 peers, most of which you don't have channels open with, and they disconnect from time to time?
That's totally normal: connections aren't persistent unless you have a channel with them. If you weren't able to keep connections to peers you had a channel with, then that would actually be something to look into. Your channel peers are the only peers that really matter.
Having more connections to other peers you don't have channels with doesn't necessarily help your node, and doesn't help you route any more effectively.
the core of the issue is described in first paragraph of OP (I am on 0.17 now and still have to reproduce it, on 0.16.4 it was like clockwork that it would happen after some time). Several channel peers (regardless of non-channel peers which is normal they would disconnect after a while) would get randomly disconnected and reconnect after a few minutes with the error in OP title, and this "signaled" the issue had started, as all forwarding activity of my node stopped since the first disconnections happened, and until I restarted lnd. One of my channel peers who tracks these stats, reported a high sudden spike of pending outgoing htlcs to me, that slowly receded, all the while no htlcs were actually getting through to me. This happened every 2 days in the last week, and at an interval of about 1-2 weeks before then. During these last days I have also had another event repeat itself twice, which I can't bundle with logs, where I had over 100 pending incoming htlcs (just a handful of outgoing ones) which wouldn't resolve until I restarted lnd.
In the meanwhile I have upgraded and been running 0.17 for 2.5days now with no issue, and I just restarted after having increased the number of active sync peers, to test if the new version has the same issue still.
Hi @KnockOnWoodNode is this still an issue after upgrading to 0.17?
Hi @KnockOnWoodNode is this still an issue after upgrading to 0.17?
looks like it is not until now, been running with many active sync peers and no funny business. Still not completely left behind I'd say, but it's looking good
Thanks for reporting back @KnockOnWoodNode closing this issue now. Please reopen if you observe the issue again and we can investigate.
Background
It will happen, after N time, that my node starts having channels deactivated to several peers, one after the other, to be reactivated in a few minutes time tops; at the same time, it will become unable to route HTLCs, as I see no more forwards being notified by bos bot, and one of my peers who tracks these things reports a spike in pending outgoing HTLCs from their node to mine, whenever this happens, that will slowly resolve themselves by failing. Restarting lnd solves the issue, until next time this happens.
I couldn't make solid hypotheses about why this happens, but here's all the details that I can provide so you maybe have some ideas of your own. I run sqlite backend, and increased timeout to 10m to avoid SQLITE_BUSY errors. I don't remember this error happening before, but I am 90% sure it started after I began connecting to more peers other than my direct channel ones, to get gossip updates faster from the network (this is before I knew about active sync peers and passive sync peers, before I was connecting to many peers which were all passive, I later on caught up and increased my active peers value, but all of this doesn't seem to have had any influence on the issue). What I seemed to notice, other than seeing this problem arise after I increased the number of peers I connect to, is that the more peers I have, the sooner this happens. Using persistent connections or not doesn't appear to change anything. I attached a log for one node which I picked among the ones my nodes detected disconnections to this last time. I had increased PEER loglevel to debug, and zgrep'd logs for its pubkey. I have since then restored info loglevel for everything.
I have disabled, for the time being, my script that connects to more peers, to be bale to report what happens in the upcoming days.
rocket.log
Your environment
lnd
0.16.4uname -a
on *Nix) 5.10.0-26-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linuxbtcd
,bitcoind
, or other backend 24.1Steps to reproduce
Have sqlite backend (no idea if necessary), have an active routing node with 40something channels, connect to many peers (above 300 for faster mishap) with
lncli connect <pubkey>@<address>:<port>
Expected behaviour
lnd continues operating normally, managing forwards like a champ
Actual behaviour
channels are disconnected at random, htlcs are not being processed