Open guggero opened 3 months ago
I've created this naive fix which solves the issue in the integration test: https://github.com/lightningnetwork/lnd/pull/8871 But it sounds like it's mostly fighting a symptom instead of fixing the root cause. But I'm not super familiar with the timings and dependency management of gossip stuff, so looking for other opinions.
cc @ellemouton, @Crypt-iQ
what are these services?
2024-06-26 14:22:35.524 [TRC] TCHN: Request to handle: *lnwire.ChannelUpdate
2024-06-26 14:22:35.524 [TRC] TCHN: Request to handle: 258
2024-06-26 14:22:35.524 [DBG] TCHN: Failed to handle: *lnwire.ChannelUpdate
2024-06-26 14:22:35.524 [DBG] PFSM: MsgRouter: unable to route msg protofsm.PeerMsg
Did some analysis on the logs, here's what happened.
Suppose Alice opens a channel with Bob, the normal flow, from Alice's side,
channel_ready
, handle it in handleChannelReadyReceived
, that calls addToRouterGraph
.channel_announcement
to herself -> creates the node and channel data in db.channel_update
to herself -> creates the edge in router
and,channel_update
to Bob.node_announcement
to Bob.For Bob he would do the same. However, there's no guarantee the following won't happen,
channel_ready
, finishes the sending of channel_announcement
and channel_update
to himself, and sends channel_update
and node_announcement
to Alice.channel_update
while still in the process of handling channel_ready
, in other words, there's no node info in her db.channel_update
to be processed later, as fixed in #6054.which is the case shown in the logs,
===> Alice received Bob's ChannelUpdate
2024-06-26 14:22:35.524 [DBG] PEER: Peer(02333e1d916b7d38b1a920eea1662c8e0723eb5f2604bd51d31fab9a97dda2775a): Received ChannelUpdate(chain_hash=0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206, short_chan_id=509073883725824, mflags=00000001, cflags=00000000, update_time=2024-06-26 14:22:35 +0200 CEST) from 02333e1d916b7d38b1a920eea1662c8e0723eb5f2604bd51d31fab9a97dda2775a@127.0.0.1:11702
===> Alice received Bob's NodeAnnouncement
2024-06-26 14:22:35.525 [DBG] PEER: Peer(02333e1d916b7d38b1a920eea1662c8e0723eb5f2604bd51d31fab9a97dda2775a): Received NodeAnnouncement(node=02333e1d916b7d38b1a920eea1662c8e0723eb5f2604bd51d31fab9a97dda2775a, update_time=2024-06-26 14:22:35 +0200 CEST) from 02333e1d916b7d38b1a920eea1662c8e0723eb5f2604bd51d31fab9a97dda2775a@127.0.0.1:11702
===> Alice received her own ChannelAnnouncement
2024-06-26 14:22:35.529 [DBG] DISC: Processing ChannelAnnouncement: peer=<nil>, short_chan_id=509073883725824
===> Alice received her own ChannelUpdate
2024-06-26 14:22:35.531 [DBG] DISC: Processing ChannelUpdate: peer=<nil>, short_chan_id=509073883725824,
Alice should have done the same for node_announcement
, which I though it was fixed in #7264 but apparently not.
So to properly fix it, we can,
node_announcement
after we've received a channel_update
, which is fixed from Bob's side. But this doesn't completely fix it as we don't know what other implementations would do.node_announcement
like how we handle early channel_update
.
Background
While running integration tests in the
0-19-staging
branch oflitd
, I noticed that sometimes we don't have a node announcement for a direct channel peer.I tracked it down to these log lines:
This is after we've opened an unannounced channel to that node. And it doesn't happen all the time, so it must be a timing issue.
It's pretty weird that we apparently haven't seen this issue in vanilla
lnd
yet, or at least it doesn't seem like this happens in the itests there. Or maybe this is an existing flake that we just haven't noticed yet?Here's the full log (
litd
in integrated mode with custom channel logic): 3-test_custom_channels_force_close-Charlie-03df097b.log