lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.64k stars 2.07k forks source link

LND peers fails to send revocation ack, channels stuck offline #6277

Open viaj3ro opened 2 years ago

viaj3ro commented 2 years ago

I'm running eclair 0.6.2

I have several channels with the bluewallet node: 682798x821x1 698337x1000x1 715719x2221x1

All channels are offline and upon reconnecting, I receive this:

2022-01-30 23:11:05,863 WARN  f.a.e.channel.Channel n:037cc5f9f1da20ac0d60e83989729a204a33cc2d8e80438969fadf35c1c5f1233b c:af5de2f0ee565aa13c84a42bfbbc26513933e27fc7b1fbde0314ae62f107d4eb - waited for too long for a revocation to remoteCommitNumber=12581, disconnecting
2022-01-30 23:11:05,863 INFO  fr.acinq.eclair.io.Peer CON n:037cc5f9f1da20ac0d60e83989729a204a33cc2d8e80438969fadf35c1c5f1233b - disconnecting
2022-01-30 23:11:05,863 INFO  f.a.e.io.PeerConnection CON n:037cc5f9f1da20ac0d60e83989729a204a33cc2d8e80438969fadf35c1c5f1233b - stopping with reason=UserRequest
2022-01-30 23:11:05,863 INFO  fr.acinq.eclair.io.Peer CON n:037cc5f9f1da20ac0d60e83989729a204a33cc2d8e80438969fadf35c1c5f1233b - connection lost
2022-01-30 23:11:05,863 INFO  f.a.e.c.TransportHandler CON n:037cc5f9f1da20ac0d60e83989729a204a33cc2d8e80438969fadf35c1c5f1233b - stopping (unackedReceived= unackedSent=None)
2022-01-30 23:11:05,863 INFO  f.a.e.i.ReconnectionTask CON n:037cc5f9f1da20ac0d60e83989729a204a33cc2d8e80438969fadf35c1c5f1233b - peer got disconnected shortly after connection was established, next reconnection in 3600 seconds

the issue seems to be this: waited for too long for a revocation to remoteCommitNumber=12581, disconnecting

I've raised the issue in the bluewallet repo, but according to them, it got nothing to do with their node specifically and is an LND issue instead.

Crypt-iQ commented 2 years ago

after trying to sign for a commitment and it's not possible due to a pending one already existing, lnd will wait 1 minute before stopping updates from happening. So eclair isn't sending a revoke_and_ack in a timely manner from lnd's PoV. Maybe it can be fixed by disconnecting and reconnecting

viaj3ro commented 2 years ago

Maybe it can be fixed by disconnecting and reconnecting

tried that. No luck. Reconnecting works but only for a short time. Channel closure also fails

Crypt-iQ commented 2 years ago

Ah, my bad, it seems like lnd is not sending back a revoke_and_ack here. That's pretty odd, I think we'd need more logs probably from the lnd side to diagnose this

viaj3ro commented 2 years ago

I asked them to provide logs. I hope they respond.

Crypt-iQ commented 2 years ago

cc @t-bast what deadline for receiving revoke_and_ack does eclair use?

t-bast commented 2 years ago

This is configurable by node operators, the default is 20 seconds: https://github.com/ACINQ/eclair/blob/be78e0ca578726015bd32e45bd0310415061843c/eclair-core/src/main/resources/reference.conf#L113

If eclair is expecting you to send a revoke_and_ack, it will disconnect after this timeout if it doesn't receive anything, hoping that you'll send it again after channel_reestablish.

Crypt-iQ commented 2 years ago

Thanks @t-bast.

@viaj3ro After eclair reconnects to lnd, do the logs show the issue persisting for the channel?

viaj3ro commented 2 years ago

yes, issue still persists for the two remaining channels. One has been force closed in the meantime (apparently due to a timed out HTLC)

Crypt-iQ commented 2 years ago

I'm not sure we can do anything without knowing why the lnd node is slow to respond, meaning we'd need logs. @BlueWallet would you be willing to provide logs to somebody on our team?

DerEwige commented 1 year ago

Had the same problem today with this node: https://amboss.space/node/035b710d2c8d25993ddcbe031c0fee5434d5e6fb7a6ed5d438404daa99199537b8

Increased my timeout to 40s and finally managed to connect and close the channel. Looks like the node runs on very low power machine and just can not keep up with the load