lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.69k stars 2.08k forks source link

[bug]: Consider failing HTLC backwards before upstream claims on-chain #7683

Open BhaagBoseDK opened 1 year ago

BhaagBoseDK commented 1 year ago

Background

Consider an HTLC chain

Peer A -> Peer B -> Offline Peer

And assume Peer B Force Closes on Offline Peer due to HTLC missing in remote commitment upon expiry of HTLC.

The Force Close transaction is stuck in mempool for 144 blocks (CLTV delta of Peer B)

Now after 144 blocks, the peer A will also force close on peer B just because peer B has not failed the HTLC backward.

This causes a cascade of FC in current mempool (and specially with peers with shorter CLTV delta).

There is a similar case with LDK -> https://github.com/lightningdevkit/rust-lightning/issues/2275

Logs: Peer B force closes on an offline peer after HTLC expiry.

/home/umbrel/lndlog/lnd.log.754.gz:2023-05-08 06:18:56.124 [INF] CNCT: ChannelArbitrator(0b228050fd8eeecf22073086a8885faf0c4d2bc02ad9480f23767368da411905:0): immediately failing htlc=3837313232636338363662613938653434353830343430613764383636646666 from remote commitment
/home/umbrel/lndlog/lnd.log.754.gz:2023-05-08 06:18:56.204 [INF] CNCT: ChannelArbitrator(0b228050fd8eeecf22073086a8885faf0c4d2bc02ad9480f23767368da411905:0): force closing chan

The force close transaction is still in mempool. 144 blocks later peer A also force closed in a cascade

/home/umbrel/lndlog/lnd.log.754.gz:2023-05-09 07:17:35.021 [INF] CNCT: Unilateral close of ChannelPoint(4d8ef40c52865f007816a151510343bb89d8d36e2dd7e34edc4748a03027e087:0) detected
/home/umbrel/lndlog/lnd.log.754.gz:2023-05-09 07:17:35.048 [WRN] HSWC: ChannelLink(4d8ef40c52865f007816a151510343bb89d8d36e2dd7e34edc4748a03027e087:0): remote peer has closed on-chain
/home/umbrel/lndlog/lnd.log.754.gz:2023-05-09 07:17:35.050 [INF] HSWC: ChannelLink(4d8ef40c52865f007816a151510343bb89d8d36e2dd7e34edc4748a03027e087:0): exited
/home/umbrel/lndlog/lnd.log.754.gz:2023-05-09 07:17:35.050 [INF] CNCT: ChannelArbitrator(4d8ef40c52865f007816a151510343bb89d8d36e2dd7e34edc4748a03027e087:0): remote party has closed channel out on-chain

The second force close would have been prevented if HTLC was failed backward by peer B after force close with Offline Peer.

Your environment

Linux umbrel 5.10.17-v8+ #1421 SMP PREEMPT Thu May 27 14:01:37 BST 2021 aarch64 GNU/Linux

Steps to reproduce

See background.

Expected behaviour

When Peer B force closes on offline peer/forward peer, it should immediately fail the HTLC backward to prevent peer A force close.

Actual behaviour

Cascade of Force Close down the chain.

ellemouton commented 1 year ago

The thing is that Peer B wants to make sure that they can claim the timeout path before they fail back the htlc. Otherwise there is a chance that the offline peer comes back online just in time & then claims the success path. This will mean a loss of funds for peer B if they have already failed the htlc back to peer A

TheBlueMatt commented 1 year ago

Sure but if you're already out of time on the backwards path you run that risk anyway? We're thinking about this on the LDK end and I'm a bit torn but it does seem like the "common case" here is the cascading failure, not the attack, though it's possible that changes with package relay.

BhaagBoseDK commented 1 year ago

the offline peer should only be able to claim until the HTLC expiry. It should have no bearing on when the FC transaction is confirmed on-chain.

ellemouton commented 1 year ago

Sure but if you're already out of time on the backwards path you run that risk anyway?

Ah, that is a good point.

the offline peer should only be able to claim until the HTLC expiry.

unforch that is not possible to enforce with bitcoin Script. After the htlc expiry, the output becomes a free-for-all if the pre-image is known.

BhaagBoseDK commented 1 year ago

another point is this specific case of "missing HTLC in remote commitment". In the FC from peer B -> offline peer, the HTLC was not in the remote commitment (therefore there is no possibility for the offline peer to come back and claim it later). The FC transaction does not even have the HTLC. 8f58a419830c62f9e708b6c47b5541c044a19a1cdc64c4eb0c903311d6282fdd In this case it could be safely failed the HTLC with peer A.

ziggie1984 commented 1 year ago

another point is this specific case of "missing HTLC in remote commitment". In the FC from peer B -> offline peer, the HTLC was not in the remote commitment (therefore there is no possibility for the offline peer to come back and claim it later). The FC transaction does not even have the HTLC. 8f58a419830c62f9e708b6c47b5541c044a19a1cdc64c4eb0c903311d6282fdd In this case it could be safely failed the HTLC with peer A.

Not really because you have to make sure that your Commitment (without the HTLC) is confirmed, because your Peer has a valid Commitment Transaction with the HTLC included (at least you think it has one, you cannot be sure it did not receive it because he was offline before), this means this HTLC could very much be confirmed when your Peer has the preimage and decides to go onchain.

ziggie1984 commented 1 year ago

Ok I was having the same case with and Incoming and Outgoing HTLC being stuck because the outgoing HTLC was going onchain (and did not confirm until the incoming HTLC would run into the timeout). But luckily my incoming HTLC was failed back because of a positive side_effect of the interceptor. Basically the interceptor will fail all expiring incoming HTLCs which are close to expiry [13 blocks away]. (https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/interceptable_switch.go#L470). Thats exactly what happened in my case, it canceled it exactly 13 blocks before timeout.

I think the important code part is here:

https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/interceptable_switch.go#L293

here we cancel all incoming HTLCs although their outgoing counterpart is not resolved yet, at least we do not check whether there is an Outgoing HTLC on the downstream channel.

This failing of an incoming HTLC where the outgoing is still stuck is pretty new (9 months) and really hard to test in regtest mode (filling the mempool with unconfirmed TX), could you look into it @joostjager whether my analysis is correct :)

What I am saying is basically when you have an Interceptor running it will fail back Incoming HTLCs although their Outgoing counterpart is still not resolved yet. I think it's good, because otherwise your Peer will Force-Close on you anyways and you will lose the second channel.

yyforyongyu commented 1 year ago

Given the setup, Peer Alice -> Peer Bob -> Offline Peer Charlie, if Charlie is offline during the whole time, then yeah it's safe to cancel the HTLC, but you can't be sure. If Charlie comes online after the FC, there are two scenarios,

  1. the HTLC hasn't timed out yet, Charlie can claim the HTLC via the preimage path, Bob will extract the preimage from mempool and settle the incoming HTLC with Alice.
  2. the HTLC has timed out, and the sweeping tx is not yet confirmed, Charlie can still claim it by racing against Bob. In this case, Bob will extract the preimage and settle the HTLC with Alice.

This means Bob would not lose the HTLC if Charlie decides to come online and claim it for w/e reason. However, if Bob cancels the HTLC with Alice after the FC, he is at risk of losing it if Charlie decides to cheat.

So IMO canceling early is not a good choice. Instead, assuming this is an anchor channel, the most feasible way is to fee bump the force close tx.

Crypt-iQ commented 1 year ago

Think we can close this

BhaagBoseDK commented 1 year ago

Ok I was having the same case with and Incoming and Outgoing HTLC being stuck because the outgoing HTLC was going onchain (and did not confirm until the incoming HTLC would run into the timeout). But luckily my incoming HTLC was failed back because of a positive side_effect of the interceptor. Basically the interceptor will fail all expiring incoming HTLCs which are close to expiry [13 blocks away]. (https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/interceptable_switch.go#L470). Thats exactly what happened in my case, it canceled it exactly 13 blocks before timeout.

I think the important code part is here:

https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/interceptable_switch.go#L293

here we cancel all incoming HTLCs although their outgoing counterpart is not resolved yet, at least we do not check whether there is an Outgoing HTLC on the downstream channel.

This failing of an incoming HTLC where the outgoing is still stuck is pretty new (9 months) and really hard to test in regtest mode (filling the mempool with unconfirmed TX), could you look into it @joostjager whether my analysis is correct :)

What I am saying is basically when you have an Interceptor running it will fail back Incoming HTLCs although their Outgoing counterpart is still not resolved yet. I think it's good, because otherwise your Peer will Force-Close on you anyways and you will lose the second channel.

If this is possible in interceptor why not in standard lnd?

ziggie1984 commented 1 year ago

This means Bob would not lose the HTLC if Charlie decides to come online and claim it for w/e reason. However, if Bob cancels the HTLC with Alice after the FC, he is at risk of losing it if Charlie decides to cheat.

So IMO canceling early is not a good choice. Instead, assuming this is an anchor channel, the most feasible way is to fee bump the force close tx.

Not sure if you read my comment, but having an active Interceptor will cancel it back although the downstream HTLC is not resolved, I think its unintended behaviour (see my comment above) should I investigate it further @yyforyongyu ?

Crypt-iQ commented 1 year ago

I think there are some false assumptions going on here, LND will cancel back dust HTLCs (e.g. not on the commitment tx) here: https://github.com/lightningnetwork/lnd/blob/fd9adaf6ceb3649c07bbb4982bd60dd632e8cda0/contractcourt/channel_arbitrator.go#L1671-L1677 which then gets failed back to the incoming channel here: https://github.com/lightningnetwork/lnd/blob/fd9adaf6ceb3649c07bbb4982bd60dd632e8cda0/contractcourt/channel_arbitrator.go#L2145-L2153

So either peer A force closed for another reason or there is a separate bug

Crypt-iQ commented 1 year ago

Reopening for discussion

BhaagBoseDK commented 1 year ago

This FC is still not confirmed in mempool and which is why peer B has not removed/failed the HTLC with peer A.

This is not true, the log line "immediately failing..." means that the HTLC was dust and failed backwards.

The HTLC is missing in remote commitment because the peer is offline and therefore has not acked the HTLC.

There would be two commitments, the remote pending commitment and the remote commitment. It would be in the remote pending commitment.

yyforyongyu commented 1 year ago

@ziggie1984 yes please!

Crypt-iQ commented 1 year ago

Accidentally edited instead of commenting, but here's my comment:

This FC is still not confirmed in mempool and which is why peer B has not removed/failed the HTLC with peer A.

This is not true, the log line "immediately failing..." means that the HTLC was dust and failed backwards.

The HTLC is missing in remote commitment because the peer is offline and therefore has not acked the HTLC.

There would be two commitments, the remote pending commitment and the remote commitment. It would be in the remote pending commitment.

BhaagBoseDK commented 1 year ago

Well in that edit you seem to have remove relevant information.

-> The HTLC in question was 20015. Is that dust? -> The HTLC was in remote pending commitment. So upon expiry peer B force closed with offline peer. See txn 8f58a419830c62f9e708b6c47b5541c044a19a1cdc64c4eb0c903311d6282fdd. You can see the HTLC is not present in this force close (because it was not acked by offline peer). -> This txn was not confirmed for 144 blocks due to congested mempool. So after 144 blocks (CLTV delta of peer B), peer A force closed on peer B. See txn 8dcdcb446b3cbfc38e6164e03592c4593654d29426e27c036d4948f7403d509a. The HTLC is present in this transaction indicating it was not failed back.

Crypt-iQ commented 1 year ago

Relevant log line is here: https://github.com/lightningnetwork/lnd/blob/fd9adaf6ceb3649c07bbb4982bd60dd632e8cda0/contractcourt/channel_arbitrator.go#L1868 meaning that the HTLC is failed back, but there is perhaps a bug in the code somewhere which we can't diagnose without logs

ziggie1984 commented 1 year ago

I analysed this situation further and can conclude that LND will not cancel back the HTLC (if it's not dust) and will hold onto it until the peer FCs the outgoing HTLC (without a registered Interceptor).

With or without a registered interceptor LND will fail the incoming HTLC back without verifying that the outgoing HTLC is still active iff the incoming HTLC runs into the RejectDelta of 13 blocks AND the ChannelLink is reconnected.

Scenario: Alice => Bob => Carol

Bob has a increased RejectDelta: https://github.com/lightningnetwork/lnd/blob/master/htlcswitch/interceptable_switch.go#L562

Replaced it with (40+78)

Now Carol creates a hold-invoice, Bob registers an interceptor. I will now mine 3 blocks so I come into the RejectDelta of Bob, and I only need to Reconnect/Disconnect and the Incoming HTLC fails although the outgoing is still not resolved.

Log of Bob (as expected):

2023-05-15 10:53:58.060 [DBG] HSWC: Interception rejected because htlc expires too soon: circuit=(Chan ID=204:1:0, HTLC ID=3), height=216, incoming_timeout=333
2023-05-15 10:53:58.060 [DBG] HSWC: ChannelLink(3bb535672973053a3184cf77ced48583204c4252521221518c05e619fbcccd19:0): queueing removal of FAIL closed circuit: (Chan ID=204:1:0, HTLC ID=3)->(Chan ID=199:2:0, HTLC ID=0)

Now I cancel back the holdinvoice on Carol's node:

Now the logs show as expected on Bob's node:

2023-05-15 11:03:09.644 [ERR] HSWC: unable to find target channel for HTLC fail: channel ID = 199:2:0, HTLC ID = 1
2023-05-15 11:03:09.644 [ERR] HSWC: Unhandled error while reforwarding htlc settle/fail over htlcswitch: unable to find target channel for HTLC fail: channel ID = 199:2:0, HTLC ID = 1
2023-05-15 11:03:10.114 [DBG] HSWC: Sent 0 satoshis and received 0 satoshis in the last 10 seconds (0.100000 tx/sec)

Before fixing this issue, I would like to propose a Config-Setting where the Node-Runner can decide whether he is willing to bear the risk and cancel back Incoming HTLCs when the Outgoing HTLC is still not resolved (not worth maybe sweeping because chain fees are too high). Otherwise I find this "bug" kind of handy for now to cancel back if I want to in case my Outgoing HTLCs are not resolved in time.

To fix this issue we definitely need to check if there is still an outgoing HTLC at play before canceling back.

Crypt-iQ commented 1 year ago

@BhaagBoseDK can you share logs so we can diagnose the bug when we get a chance

BhaagBoseDK commented 1 year ago

my log snippets are in the description unless you want specific htlc or time window? Please note I do not have debug logs.

Crypt-iQ commented 1 year ago

@BhaagBoseDK can you save your logs for the specific htlc and for this time so somebody can look at it when this gets prioritized?

SunnySarahNode commented 11 months ago

I managed to catch a very good case.

Me: LND 0.17.3, bitcoind 25 (full indexed, mempool = 3Gb), Ubuntu, clearnet only My peer: unknown, tor only

I have a HTLC in our channel:

{
    "incoming": false,
    "amount": "54572",
    "hash_lock": "c99e...83f6",
    "expiration_height": 821335,
    "htlc_index": "87642",
    "forwarding_channel": "896115171070509057",
    "forwarding_htlc_index": "78605"
}

The known part of the route is :

(1) (someone) (2 - me) 03423790614f023e3c0cdaa654a3578e919947e4c3a14bf5044e7c787ebd11af1a (3 - my peer) 021720a04a2094ccff4c56bd6ab20f7e36e0af17cb0d3b90ea00ce0f07bd51cf8c (4) 0284e3ca3753632c51a7d9a156370161ce2a19af41dbf4966eecf74bf3f7ba0a79 (5) (someone)

The channel between (3) and (4) was FCed : tx ce960ba459e62fbbe9178130de89fb595afa8ffb390b954d3e3f3aaf4e0f3f56

The relevant HTLC - in the channel between (3) and (4) - went onchain :

image

In our channel - between (2) and (3) - this HTLC is still alive and its status does not change. There are no records in the log (HSWC is in DEBUG mode) with this HTLC. Other contracts appear and are getting resolved as usual in this channel, both nodes are ok and online, our channel is active and enabled on both sides, my HSWC works as usual.

Obviously the channel closing transaction will not be mined until the HTLC expires and our channel - between (2) and (3) - is doomed to be FCed.

Reconnecting or restarting the nodes (both mine and my peer's) doesn't help.

Question 0 is : do I understand the situation correctly?

...and if yes...

Question 1 is : how was this guy (4) able to FC the channel between (3) and (4) with a fee 9 sats/vbyte while the normal fee at that moment was more than 100? Сan I do the same with my channels? ;)

Question 2 is : I definitely don't want to pay a 100000+ sats fee for this ability of that guy, which is not even my peer. Сan we somehow avoid such situations?

ziggie1984 commented 11 months ago

Question 2 is : I definitely don't want to pay a 100000+ sats fee for this ability of that guy, which is not even my peer. Сan we somehow avoid such situations?

Make sure you reconnect to peer 3 when the htlc approaches the blockdealine + 13 blocks, only then will your peer fail the htlc back and no FC will happen on your channel.

Question 1 is : how was this guy (4) able to FC the channel between (3) and (4) with a fee 9 sats/vbyte while the normal fee at that moment was more than 100? Сan I do the same with my channels? ;)

The max-anchor-commitfee is default to 10sats, but I am wondering why the channel is not CPFPed, maybe its already purged out of the mempool by the respetive nodes, then he will not be able to bump it.

bitromortac commented 11 months ago

The max-anchor-commitfee is default to 10sats, but I am wondering why the channel is not CPFPed, maybe its already purged out of the mempool by the respetive nodes, then he will not be able to bump it.

I have seen many of these cases where the commitment fee rate is at around 10sat/vbyte (see https://github.com/lightningnetwork/lnd/discussions/8271), although it should be higher (https://github.com/lightningnetwork/lnd/issues/8240#issuecomment-1854546090).

SunnySarahNode commented 11 months ago

@ziggie1984, Thank you for your answer.

Make sure you reconnect to peer 3 when the htlc approaches the blockdealine + 13 blocks, only then will your peer fail the htlc back and no FC will happen on your channel.

Practice shows that in such cases reconnection does not help, but restarting the node shortly before the expiration of the HTLC helps. Obviously there is some difference between a simple reconnect and what happens after restart. I'll try to collect some logs and come back when I find something interesting.

ziggie1984 commented 11 months ago

I have seen many of these cases where the commitment fee rate is at around 10sat/vbyte (see https://github.com/lightningnetwork/lnd/discussions/8271), although it should be higher (https://github.com/lightningnetwork/lnd/issues/8240#issuecomment-1854546090).

Good input, so looked at it as well, so #8271 definitely is not the right behavior in during chan opening, but the fee negotiation for normal UpdateFee msgs should cap at the min_relay fee. Tho there might always be the problem between the two peers, the initiator might have an increased mempool, but the node force-closing the channel might not, so we might end up in this situation where the non-initiator cannot bump the fee of the commitment. Not sure if there is really a fix for this for now, because not accepting feeupdates might cause problems. 🤔

ziggie1984 commented 11 months ago

Practice shows that in such cases reconnection does not help, but restarting the node shortly before the expiration of the HTLC helps. Obviously there is some difference between a simple reconnect and what happens after restart. I'll try to collect some logs and come back when I find something interesting.

That would be great, are you verifying that you disconnect the peer and then connect again, because the link needs to be torn down for this to work.

SunnySarahNode commented 11 months ago

That would be great, are you verifying that you disconnect the peer and then connect again, because the link needs to be torn down for this to work.

Of course. Disconnect and connect the peer again.