[bug]: GossipSyncer should remove unreliable Peers from syncing the ChannelGraph

ziggie1984 commented 3 months ago

There are peers in the network which have a very old ChannelGraph view and when connected to them it requires LND to do a lot of unnecessary work to check for the validity of their out of date graph data. Moreover for pruned nodes this comes with a burden in terms of bandwidth because for channels the node doesn't have the blockdata anymore they need to be fetched from peers which is very bandwidth intensive especially if you use VPN services for your node which restrict the bandwidth at certain points.

I think we should first introduce some sanity checks, if our channel-graph and the channel-graph of our peers is too far apart, we should trust ourselves more and remove the peer from syncing the data. Probably we should also ban that peer at some point.

This is how an out of date peer sync could look like (form a mainnet node):

 DISC: GossipSyncer(02dfe525d9c5b4bb52a55aa3d67115fa4a6326599c686dbd1083cffe0f45c114f8): starting query for 164060 new chans
2024-06-29 11:20:06.456 [INF] DISC: GossipSyncer(02dfe525d9c5b4bb52a55aa3d67115fa4a6326599c686dbd1083cffe0f45c114f8): querying for 500 new channels
2024-06-29 11:20:06.717 [INF] DISC: GossipSyncer(02dfe525d9c5b4bb52a55aa3d67115fa4a6326599c686dbd1083cffe0f45c114f8): querying for 500 new channels
2024-06-29 11:20:06.894 [ERR] CRTR: process network updates got: unable to fetch utxo for chan_id=556369376388317185, chan_point=8450839336340be66e5e1bf02d6ab046d4e83b44d97f034f0b8da609e4f74df1:1: target output has been spent
2024-06-29 11:20:06.988 [ERR] CRTR: process network updates got: unable to fetch utxo for chan_id=556449640836104192, chan_point=6950a0ad8d4d3751f26afd8ab09c32e28b017e721503adefdf6d1f06ae2b19d4:0: target output has been spent
2024-06-29 11:20:07.210 [ERR] CRTR: process network updates got: unable to fetch utxo for chan_id=556484825143181312, chan_point=8ca07553aff9287badc5a792f85e6785aaccd8c8596b3d7664eb7499e9de67fb:0: target output has been spent
2024-06-29 11:20:07.567 [ERR] CRTR: process network updates got: unable to fetch utxo for chan_id=556816877593165824, chan_point=1451d36ff82bb6faa74eae09486b520eef1702a41273ea3c5e6785daaff2fd40:0: target output has been spent

....

Roasbeef commented 2 months ago

Re sanity checks, today we do have a reject cache here to ensure that we won't continue validating channels that we already deemed to be invalid.

Impa10r commented 2 months ago

Happened to me twice this week. LND quietly becomes unresponsive and the node must be rebooted. Logs are completely full with these errors. How do I find out which peer is causing this, so I can close the channel?

Ok, found the peer by looking for filtering through ..... chans large number > 100k

ziggie1984 commented 2 months ago

Try to find this log line:

GossipSyncer(XXX): starting query for ???? new chans

what you can do as a preliminary measure is the following:

Close the channel to this node (above node id), also make sure you don't connect to this peer anymore by blocking its traffic and also inbound connection from this peer.

or

use specific peers for the acitve peer sync so that you don't run into the risk of peering with the particular node again.

Specify exactly the amount of synced peers you have set up for active sync (default: numgraphsyncpeers=3)

[Application Options]

# Always sync from these peers:
gossip.pinned-syncers=<pubkey>

Impa10r commented 2 months ago

I identified a certain peer as the cause, so I limited my sync peers to Boltz, WoS and Bfx. All was fine for a few days, but today the problem returned. LND log and all three archives are completely full with 'unable to fetch utxo', so I cannot find 'starting query for'...

Edit: Same when disconnecting Boltz or WOS. Anyone knows a reliable gossip peer that won't cause this mess?

Impa10r commented 2 months ago

use specific peers for the acitve peer sync

Well, this did not work. I pinned the peers, but still see starting query for <large number> from other peers, some of which causing the above problem. LND 0.18.0...

My lnd.conf:

# The number of peers that we should receive new graph updates from. This option
# can be tuned to save bandwidth for light clients or routing nodes. 
numgraphsyncpeers=3

# Always sync from these peers:
# Boltz
gossip.pinned-syncers=026165850492521f4ac8abd9bd8088123446d126f648ca35e60f88177dc149ceb2
# WoS
gossip.pinned-syncers=035e4ff418fc8b5554c5d9eea66396c227bd429a3251c8cbc711002ba215bfc226
# Bfx
gossip.pinned-syncers=03cde60a6323f7122d5178255766e38114b4722ede08f7c9e0c5df9b912cc201d6

Log output:

$ docker logs lightning_lnd_1 --follow --tail 100000 | grep "starting query for"
2024-07-29 20:39:59.557 [INF] DISC: GossipSyncer(03d2d9b865127074632a6b8f8e67c977e9618b55ba0e7d9c005011549bd6ecc302): starting query for 38479 new chans
2024-07-29 20:40:02.081 [INF] DISC: GossipSyncer(026165850492521f4ac8abd9bd8088123446d126f648ca35e60f88177dc149ceb2): starting query for 2 new chans
2024-07-29 23:39:55.568 [INF] DISC: GossipSyncer(023369b071005d376bfa94b76d2ec52d01d6709f57d6b67bf26250f2ef57a66aea): starting query for 2 new chans
2024-07-30 01:39:55.935 [INF] DISC: GossipSyncer(0293bc8988a035126d3d343cd918c9f336567faf8cbe41d9df7574c8644895aa14): starting query for 78 new chans
2024-07-30 03:39:54.865 [INF] DISC: GossipSyncer(026d0169e8c220d8e789de1e7543f84b9041bbb3e819ab14b9824d37caa94f1eb2): starting query for 4 new chans
2024-07-30 05:39:56.089 [INF] DISC: GossipSyncer(0293bc8988a035126d3d343cd918c9f336567faf8cbe41d9df7574c8644895aa14): starting query for 78 new chans
2024-07-30 06:39:55.333 [INF] DISC: GossipSyncer(0201bfc5d9c14d50ebaf4dfd83d70df314ed2a9f0119f28619f3f000a1079ef580): starting query for 5 new chans
2024-07-30 07:39:53.836 [INF] DISC: GossipSyncer(03e86afe389d298f8f53a2f09fcc4d50cdd34e2fbd8f32cbd55583c596413705c2): starting query for 1 new chans
2024-07-30 09:39:54.518 [INF] DISC: GossipSyncer(03ac432467085ff175b84e39def012f616cdbf9164a6923d1bbfffe710f9c8ad3f): starting query for 1 new chans

Crypt-iQ commented 2 months ago

I identified a certain peer as the cause, so I limited my sync peers to Boltz, WoS and Bfx

Can you share the peer's NodeID? I want to connect to it and see what's happening.

ziggie1984 commented 2 months ago

The peer I identified while analysing the issue was this one: https://amboss.space/node/02dfe525d9c5b4bb52a55aa3d67115fa4a6326599c686dbd1083cffe0f45c114f8

Impa10r commented 2 months ago

Can you share the peer's NodeID? I want to connect to it and see what's happening.

Sure, by bad peers are: 023e09c43b215bd3dbf483bcb409da3322ea5ea3b046f74698b89ee9ea785dd30a 0309bf5fe006bd2b14d61126723481c6c685bda44efe093800908527dfe9cd8db7 02ce7abb5e625990f7898520bca721b7ae316165eef81133dcfe4068fcfbe37ac6 03d2d9b865127074632a6b8f8e67c977e9618b55ba0e7d9c005011549bd6ecc302

Impa10r commented 2 months ago

Two of my bad peers are running 0.18.0. the other two did not respond ..

Crypt-iQ commented 2 months ago

@Impa10r what kind of hardware is your node running on?

Impa10r commented 2 months ago

@Impa10r what kind of hardware is your node running on?

Raspi4 8gb. It can be that this is not a new problem, but LND 0.18 is more CPU intensive.

dunkfan commented 2 months ago

I also saw this on a Raspi4 8gb. The pinned peers suggestion seems to have fixed it, running a few days now. I'm running 0.18.2

dunkfan commented 2 months ago

Happened again today. Needed a restart to recover 2024-08-01 08:07:57.089 [INF] DISC: GossipSyncer(023e09c43b215bd3dbf483bcb409da3322ea5ea3b046f74698b89ee9ea785dd30a): querying for 500 new channels The query appeared in a seemingly endless loop, 500 channels a time

Impa10r commented 2 months ago

I had 1101 from a different peer, but LND did not restart this time.

Crypt-iQ commented 2 months ago

Do any of you have zero-conf channels? @Impa10r @dunkfan

dunkfan commented 2 months ago

lncli listchannels shows zero_conf as false on all channels I do have one private channel operational

Impa10r commented 2 months ago

lncli listchannels shows zero_conf as false on all channels I do have one private channel operational

private are normally not zero_conf. https://github.com/blckbx/NodeTools/blob/main/private-trusted-zero-conf-channels.md

dunkfan commented 1 month ago

listchannels reports private as true and zero_conf for that channel as false I have since moved bitcoin core from 24.0.1 to 25.2 So far 1 day no sign of the issue. Will stay vigilant

dunkfan commented 1 month ago

No that made no difference, same problem twice today. Seems to be caused by the same peer node

ziggie1984 commented 1 month ago

For all people struggling with this problem can you set the following config setting until a proper fix is out.

numgraphsyncpeers=0

this soley guarantees that only the pinned peers will be used for a historical scan of the channel graph rather than some other random peers.

Will keep you updated while working on a proper fix.

saubyk commented 1 month ago

this soley guarantees that only the pinned peers will be used for a historical scan of the channel graph rather than some other random peers

By "pinned peers" you mean peers which a node has active channels with, correct?

ziggie1984 commented 1 month ago

By "pinned peers" you mean peers which a node has active channels with, correct?

Good question, yes only peers you have channels with should be used for pinned peers, pinned peers can basically be selected via the lnd.conf file (theoretically it could be any peer but the problem would be we wouldn't guarantee connectivity with the peer so a peer should be used that we have a channel with):

https://github.com/lightningnetwork/lnd/blob/2f2efc782436944ce98b1e0e13bde125951b2b36/sample-lnd.conf#L1620-L1637

ziggie1984 commented 1 month ago

What was the cause of the problem:

To end up in the in the described situation several factors had to play out.

So one problem is that we would reply in the reply_channel_range with a future timestamp for channels we would not have real ChanUpdate for so that would trigger the querying node to resurrect all its Zombie Channels and query them during the sync. This is because time.Time is not 0 when not initialized here: https://github.com/lightningnetwork/lnd/blob/c0420fe67de89cb28ac0abd8954208567109b5f9/channeldb/graph.go#L2287-L2289

rather its: 0xFFFFFFF1886E0900 or the golang representation: -e7791f700

When converted into uint32 we get 886E0900. See here:

https://github.com/lightningnetwork/lnd/blob/c0420fe67de89cb28ac0abd8954208567109b5f9/discovery/syncer.go#L1119-L1125

So basically now all Zombie Channels will be reactivated. It only needs one node in the network to serve this kind of Gossip Msg (for example a neutrino node) which runs via the asssumeValid flag.

Potential Solution/ Fix

Problem with the RejectCache:

The Reject Cache is a rotating memory with a default size of 50k entries, we saw in the log files that sometimes we receiver way over 200k old channels we already deleted or forgot. This high amount of new Chans makes the rejectCache basically useless because it rotates away. We could increase it, but I don't think we should go into this direction

So what I suggest implementing:

Fix the timestamp bug
Make sure we don't use the timestamps in the reply_channel_range msg to resurrect zombie channels, because this can be easily be circumvented.
Make sure we also reject Updates which are too far into the future, maybe also 2 weeks ?
Introduce a ratelimiter when fetching the data from the peer
Introduce a ban list, which the user can enhance manually via maybe an RPC call when a peer gossip sync gets out of control, and in addition maybe ban peers via some heuristic. For example remembering the new channels we got from this peer when syncing for the first time and if this number is still the same (when fetching historical data again) we can be sure something is off and we ban the peer.

Looking forward for your evaluation.

Special thanks to @Crypt-iQ who got me into the right direction where to look for when analysing this problem.

Crypt-iQ commented 1 month ago

So one problem is that we would reply in the reply_channel_range with a future timestamp for channels we would not have real ChanUpdate for so that would trigger the querying node to resurrect all its Zombie Channels and query them during the sync.

Great catch! I agree that reply_channel_range with bogus timestamps are part of the problem; that explains why I was seeing future timestamps. I think something else is happening that is causing these nodes to broadcast announcements for outdated channels. LND has the following behavior related to spent outputs and zombies:

prunes (deletes) graph edges that have had their funding outpoints spent.
deletes edges when channels are marked as zombies.
cleans up the zombies after a certain interval. Since LND deletes these zombie edges, these spamming nodes can't force LND to resurrect deleted zombie channels.
resurrects zombie channels upon receiving a ChannelUpdate with more recent timestamps via MarkEdgeLive. Note that this only means the zombie edge is removed from the zombie bucket if it exists. It does not mean that we insert an edge into the edgeIndexBucket. I think this means we can ignore point 2 in ziggie's comment.

Because LND deletes the edges and doesn't totally resurrect them by updating edgeIndexBucket, it seems that full nodes can't be serving these outdated announcements. Also, given the behavior above, non-neutrino LND nodes will do the following when receiving a reply_channel_range with bogus timestamp entries:

look in the graph and find the entry in either the zombie bucket or not find it at all
request the channel's ChannelAnnouncement + ChannelUpdate via QueryShortChanIDs
receive the associated ChannelAnnouncement and fail to validate it, marking the channel as a zombie

This points to the spam-serving nodes being neutrino nodes as they won't validate the ChannelAnnouncement due to AssumeChannelValid being true. If the spammy nodes are running neutrino, I'm not sure how they were first "infected" with old channels. I'll try fingerprinting the spammy nodes and see if they are running neutrino.

ziggie1984 commented 1 month ago

This points to the spam-serving nodes being neutrino nodes as they won't validate the ChannelAnnouncement due to AssumeChannelValid being true. If the spammy nodes are running neutrino, I'm not sure how they were first "infected" with old channels. I'll try fingerprinting the spammy nodes and see if they are running neutrino.

Very good analysis, I think also running for a short period of time in neutrino mode and then switching back could also infect us with all those channels which are already closed. I think we would never get rid of them without dropping the chan-graph.

resurrects zombie channels upon receiving a ChannelUpdate with more recent timestamps via MarkEdgeLive ...

I think as long as we introduce the banman we can ignore the timestamp check in the reply_channel_range msg as you mentioned. Otherwise I think it might serve at least as some kind of protection, for example if we would not have marked edges live again based on the timestamp, all of the channels not validateable would have all ended up in the zombieIndex and hence never requested again. So I think without the ban-protection letting a peer resurrect all our zombiechannels and send us all the ChanAnnouncments over and over again might be a vulnerability (in case the rejectCache remains the same size as is today)

Crypt-iQ commented 1 month ago

It seems that these nodes are not neutrino nodes, so I'm not sure how they are storing old announcements...

ziggie1984 commented 1 month ago

So one other problem we have to think about is, that those nodes are not pruning those old announcements, I think the culprit is, that they have also no ChanUpdate for those announcements hence no entry in the edgeUpdateIndexBucket bucket which means they will not be picked up by the pruneZombieChans job. I think we are missing an entry in this bucket when receiving a chan announcement in general here:

https://github.com/lightningnetwork/lnd/blob/0aced5ce2dfb618a0ee6dcec931212308085d8f5/channeldb/graph.go#L4404-L4416

we probably need to insert a dummy timestamp into the edgeUpdateIndexBucket (maybe the current timestamp) so we also remove channels which don't receive any chanupdate after 2 weeks via the pruning job.

HamishMacEwan commented 1 month ago

In case more examples are useful:

admin@myNode:~ $ sudo journalctl --no-page -u lnd.service | grep "starting query for"
Aug 24 18:48:31 myNode lnd[10853]: 2024-08-24 18:48:31.474 [INF] DISC: GossipSyncer(026ec3e3438308519a75ca4496822a6c1e229174fbcaadeeb174704c377112c331): starting query for 423 new chans
Aug 24 20:48:32 myNode lnd[10853]: 2024-08-24 20:48:32.949 [INF] DISC: GossipSyncer(03e691f81f08c56fa876cc4ef5c9e8b727bd682cf35605be25d48607a802526053): starting query for 1 new chans
Aug 24 23:48:37 myNode lnd[10853]: 2024-08-24 23:48:37.677 [INF] DISC: GossipSyncer(02f95c6f3498caf4f6811c5edc4ac9806ba345534a18391e94ae9f7af41104df18): starting query for 5 new chans
Aug 25 01:48:29 myNode lnd[10853]: 2024-08-25 01:48:29.672 [INF] DISC: GossipSyncer(03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f): starting query for 2 new chans
Aug 25 02:48:29 myNode lnd[10853]: 2024-08-25 02:48:29.779 [INF] DISC: GossipSyncer(03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f): starting query for 1 new chans
Aug 25 04:48:50 myNode lnd[10853]: 2024-08-25 04:48:50.125 [INF] DISC: GossipSyncer(0236e577964da670c730e7f362fa0c561b75dc9af02c0d822179f155e1455b20bd): starting query for 118268 new chans
Aug 25 05:48:33 myNode lnd[10853]: 2024-08-25 05:48:33.962 [INF] DISC: GossipSyncer(034ea80f8b148c750463546bd999bf7321a0e6dfc60aaf84bd0400a2e8d376c0d5): starting query for 53 new chans
Aug 25 06:48:31 myNode lnd[10853]: 2024-08-25 06:48:31.282 [INF] DISC: GossipSyncer(03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f): starting query for 2 new chans
admin@myNode:~ $

118268 new chans 🙄

The requesting node is a Voltage.cloud lite node of mine that has only the one channel and is running 0.18.2-beta

saubyk commented 1 month ago

Closed with https://github.com/lightningnetwork/lnd/pull/9009 and https://github.com/lightningnetwork/lnd/pull/9011

lightningnetwork / lnd