lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.67k stars 2.07k forks source link

Improve channel force close logging #5324

Open C-Otto opened 3 years ago

C-Otto commented 3 years ago

Background

My node force-closed a channel and the logs don't help me understand why this happened. I'd like this issue to be resolved by updating the log statements to contain the necessary information. In addition to this, I'd appreciate an explanation.

Your environment

Expected behaviour

The logs tell me why the channel was force-closed.

Actual behaviour

Note that the force-close transaction has two outputs, one already swept by my peer, the other timelocked for me. This makes me believe that my peer was online (but not connected to my node?).

  1. HTLCs, failing with "unable to complete dance error: remote unresponsive"
  2. disabling channel
  3. Updating link policy?!
  4. GossipSyncer is active with peer?!
  5. Force-close

(BST)

2021-05-20 03:16:15.988 [DBG] HSWC: ChannelLink(__SHORT_CHAN_ID_UNRELATED_NODE1__): forwarding 1 packets to switch
2021-05-20 03:16:16.076 [DBG] HSWC: ChannelLink(__SHORT_CHAN_ID__): queueing keystone of ADD open circuit: (Chan ID=__SHORT_CHAN_ID_UNRELATED_NODE1__, HTLC ID=1835)->(Chan ID=__SHORT_CHAN_ID__, HTLC ID=36)
2021-05-20 03:16:16.437 [DBG] HSWC: ChannelLink(__SHORT_CHAN_ID__): removing Add packet (Chan ID=__SHORT_CHAN_ID_UNRELATED_NODE1__, HTLC ID=1835) from mailbox
--
2021-05-20 03:33:01.864 [DBG] HSWC: ChannelLink(__SHORT_CHAN_ID_UNRELATED_NODE2__): forwarding 1 packets to switch
2021-05-20 03:33:01.952 [DBG] HSWC: ChannelLink(__SHORT_CHAN_ID__): queueing keystone of ADD open circuit: (Chan ID=__SHORT_CHAN_ID_UNRELATED_NODE2__, HTLC ID=542)->(Chan ID=__SHORT_CHAN_ID__, HTLC ID=37)
2021-05-20 03:34:02.080 [ERR] HSWC: ChannelLink(__SHORT_CHAN_ID__): failing link: unable to complete dance with error: remote unresponsive
2021-05-20 03:34:02.081 [INF] HSWC: ChannelLink(__SHORT_CHAN_ID__): exited
2021-05-20 03:34:02.082 [INF] HSWC: Removing channel link with ChannelID(__CHANNEL_ID__)
2021-05-20 03:34:02.084 [INF] HSWC: ChannelLink(__SHORT_CHAN_ID__): stopping
2021-05-20 03:34:02.086 [DBG] HSWC: ChannelLink(__SHORT_CHAN_ID_UNRELATED_NODE2__): queueing removal of FAIL closed circuit: (Chan ID=__SHORT_CHAN_ID_UNRELATED_NODE2__, HTLC ID=542)->(Chan ID=0:0:0, HTLC ID=0)
2021-05-20 03:34:02.320 [DBG] HSWC: ChannelLink(__SHORT_CHAN_ID_UNRELATED_NODE2__): removing Fail/Settle packet (Chan ID=__SHORT_CHAN_ID_UNRELATED_NODE2__, HTLC ID=542) from mailbox
--
2021-05-20 03:54:48.643 [INF] NANN: Announcing channel(__CHAN_POINT__) disabled [detected]
---
2021-05-20 15:04:00.704 [DBG] HSWC: Unable to find ChannelPoint(__CHAN_POINT__) to update link policy
---
2021-05-20 15:46:09.083 [INF] DISC: GossipSyncer(__REMOTE_PUBKEY__): applying gossipFilter(start=2021-05-20 15:46:09.083263967 +0100 BST m=+1810838.985025483, end=2157-06-26 21:14:24.083263967 +0000 GMT)
---
2021-05-20 15:52:06.400 [DBG] HSWC: Unable to find ChannelPoint(__CHAN_POINT__) to update link policy
---
2021-05-22 14:03:37.887 [INF] CRTR: Pruning channel graph using block 00000000000000000001cec503af2292875dc44437f4875f877cb498e7e4d675 (height=684537)
2021-05-22 14:03:38.091 [INF] CRTR: Block 00000000000000000001cec503af2292875dc44437f4875f877cb498e7e4d675 (height=684537) closed 1 channels
2021-05-22 14:03:38.971 [INF] NTFN: New block: height=684537, sha=00000000000000000001cec503af2292875dc44437f4875f877cb498e7e4d675
2021-05-22 14:03:38.972 [INF] UTXN: Attempting to graduate height=684537: num_kids=0, num_babies=0
2021-05-22 14:03:38.973 [ERR] LTND: Unable to lookup witness: no witnesses
2021-05-22 14:03:39.060 [INF] CNCT: ChannelArbitrator(__CHAN_POINT__): force closing chan
2021-05-22 14:03:39.172 [INF] HSWC: Removing channel link with ChannelID(__CHANNEL_ID__)
2021-05-22 14:03:39.308 [INF] CNCT: Broadcasting force close transaction [...]
2021-05-22 14:03:39.315 [INF] LNWL: Inserting unconfirmed transaction __CLOSE_TX_HASH__
2021-05-22 14:03:39.598 [INF] CNCT: ChannelArbitrator(__CHAN_POINT__): trigger chainTrigger moving from state StateCommitmentBroadcasted to StateCommitmentBroadcasted

Related questions:

C-Otto commented 3 years ago

I'm in contact with the peer node admin and we're comparing logs. It seems our nodes where connected the entire time, well before the "dance" issue, up to the force-close.

C-Otto commented 3 years ago

From my peer's logs:

(UTC-5)

2021-05-19 21:39:12.988 [ERR] HSWC: ChannelLink(__SHORT_CHAN_ID__): failing link: unable to complete dance with error: remote unresponsive
2021-05-19 21:39:12.988 [INF] HSWC: ChannelLink(__SHORT_CHAN_ID__): exited
2021-05-19 21:39:12.988 [INF] HSWC: Removing channel link with ChannelID(__CHAN_ID__)
2021-05-19 21:39:12.988 [INF] HSWC: ChannelLink(__SHORT_CHAN_ID__): stopping
Roasbeef commented 3 years ago

Is this a duplicate of that other instance? Easier to keep track of things if everything is reported to a single issue.

2021-05-19 21:39:12.988 [ERR] HSWC: ChannelLink(SHORT_CHAN_ID): failing link: unable to complete dance with error: remote unresponsive

This means the remote peer didn't respond w/ a commitment signature when it needed to do so. We added this to detect stale connections, so something was up w/ their node, exactly what I'm not sure but it was effectively "stuck" for 1 minute (the current default timeout). If the node was an ACINQ node, then they've recently fixed a force close divergence related to a divergence in the way nodes sort HTLCs.

C-Otto commented 3 years ago

You mean #5313? Those are/were different channels and issues.

The remote node is lnd 0.13.0 rc something, YTMND, @Max in slack

Crypt-iQ commented 3 years ago

Your version (v0.12.1) and your peer's version (v0.13.0) would force close on each other in certain cases. Two v0.13.0 nodes should not, if they do please file an issue

C-Otto commented 3 years ago

Alright. Could you share some details on this? I guess this shouldn't happen, no matter the versions.

C-Otto commented 3 years ago

Just to clarify: I raised this issue so that the log output can be improved. As a user I want (need?) to know the reason for a force-close. In this specific case I still have no clue.

Crypt-iQ commented 3 years ago

Alright. Could you share some details on this? I guess this shouldn't happen, no matter the versions.

We had some faulty logic in our channel state machine that would cause a force close when going down and up, that's about all I can say without getting too deep into the inner workings of the state machine. See here: https://github.com/lightningnetwork/lnd/pull/5231

C-Otto commented 3 years ago

Ok. Assuming something like this might happen again, would it be possible to add some information to the logs? Something like "X happened, should not happen, force-closing channel just to be on the safe side" is better than nothing...

Crypt-iQ commented 3 years ago

There isn't a way to know if it's an erroneous force close or not. It could either be a bug in our implementation or a buggy channel peer. So it's not really possible to add useful information to the logs.

C-Otto commented 3 years ago

I'm not limiting this to erroneous force closes.

C-Otto commented 3 years ago

To clarify: aside from user triggered force-closes (lncli closechannel --force and the like), the reason for the force-close should be communicated to the user so that the underlying issue can be resolved. I don't know the specific situations that might lead to a force-close, but I believe that adding a bit of context would be possible in virtual all of them. Was it triggered by the remote? Did the remote node send a message? Was it a disagreement about something specific? Was it a protocl breach, or something where the spec requires the channel to be force-closed?

In the case that lead me to open this issue, the bug fixed in #5231 seemed to be the reason for the force-close. Having some additional information somewhere in the log ("unexpected remote commit chain" or something like that?) might have helped pin this down, and/or identify/fix the underlying issue earlier.

Crypt-iQ commented 3 years ago

If cause of force close was #5231 it would have had a log message in the link. Did the channel have active htlcs on it? Usually the cause of force close is pretty clear from the logs so not sure whats going on

C-Otto commented 3 years ago

I posted the logs I found above. Maybe 5231 is not the root cause, maybe it is. No matter what, a simple log message "force closing channel X because Y" is all I want, where "Y" should be as specific as possible.

Crypt-iQ commented 3 years ago

thats usually the case which is why I want to pinpoint why you don't have a log message

cryptosharks131 commented 3 years ago

I have also experienced this same situation recently with very little information given for context of the force closure by my node.

[ERR] LTND: Unable to lookup witness: no witnesses
[INF] CNCT: ChannelArbitrator(xxxxxxxxxxxx): force closing chan
[INF] HSWC: Removing channel link with ChannelID(xxxxxxxxxxxx)
[INF] CNCT: Broadcasting force close transaction...........

I have been on v0.13.1 (v0.12.1 previously) for a week or so already but am unsure about my peer's version on the other side.

augustresende commented 2 years ago

Same issue.

C-Otto commented 2 years ago

My node force-closed a channel to a peer that, according to LN explorer websites, also had other channels force-closed. I think this is some kind of recovery attempt from my peer. However, the logs don't indicate why the force-close was issued.

C-Otto commented 2 years ago

Today my node force-closed a channel. The logs don't give ANY information about the reasons.

BhaagBoseDK commented 2 years ago

It will be really great to have additional logging for the reasons when a force close is initiated.

babikpatient6 commented 2 years ago

Sorry to bug. What is considered normal frequency of spurious force closures? Now they make ~67% of all closures on my node. Just over last week there was 8 of them. Both local/remote randomly. I am noob, but tried to check logs for clues and didnt see the cause. Extremely demoralising & makes operating node at net loss.

C-Otto commented 2 years ago

The changes I made for this ticket should help you understand the reasons for the closes. Until then I'm afraid you're a bit out of luck. Usually force-closes indicate that one of the nodes is offline for too long, or that there are connection issues. I suggest you dig through the logs and join the Slack chat. This issue isn't the place to discuss the details.

BhaagBoseDK commented 2 years ago

Today I had a force closure with Diamond Hand. In my lnd.log I could not find any useful information

lnd.log.267-2022-04-25 02:39:46.691 [INF] UTXN: Attempting to graduate height=733427: num_kids=0, num_babies=0 lnd.log.267-2022-04-25 02:39:46.693 [ERR] LTND: Unable to lookup witness: no witnesses lnd.log.267-2022-04-25 02:39:46.719 [INF] HSWC: Garbage collected 29 shared secret hashes at height=733427 lnd.log.267:2022-04-25 02:39:46.727 [INF] CNCT: ChannelArbitrator(786945c839f4818d9fff0e0ff189953368c15bd81ce158b732652afdaeef343d:1): force closing chan lnd.log.267-2022-04-25 02:39:46.748 [INF] HSWC: Removing channel link with ChannelID(3d34efaefd2a6532b758e11cd85bc168339589f10f0eff9f8d81f439c8456979) lnd.log.267:2022-04-25 02:39:46.748 [INF] HSWC: ChannelLink(786945c839f4818d9fff0e0ff189953368c15bd81ce158b732652afdaeef343d:1): stopping lnd.log.267:2022-04-25 02:39:46.749 [INF] HSWC: ChannelLink(786945c839f4818d9fff0e0ff189953368c15bd81ce158b732652afdaeef343d:1): exited

The force close transaction had one outgoing HTLC. This would have timed out in ~50 blocks so not sure why force close was required.

lnd.log.267:2022-04-25 02:39:46.813 [INF] CNCT: Broadcasting force close transaction ac4e47b80906f69b1cb8d637128401fc3b4bae757d82f94d7b2c153f5beb1f4e, ChannelPoint(786945c839f4818d9fff0e0ff189953368c15bd81ce158b732652afdaeef343d:1): (wire.MsgTx)(0x400cbb7bc0)({ lnd.log.267- Version: (int32) 2, lnd.log.267- TxIn: ([]wire.TxIn) (len=1 cap=1) { lnd.log.267- (wire.TxIn)(0x4016c831a0)({ lnd.log.267: PreviousOutPoint: (wire.OutPoint) 786945c839f4818d9fff0e0ff189953368c15bd81ce158b732652afdaeef343d:1, lnd.log.267- SignatureScript: ([]uint8) , lnd.log.267- Witness: (wire.TxWitness) (len=4 cap=4) { lnd.log.267- ([]uint8) , lnd.log.267- ([]uint8) (len=71 cap=144) { lnd.log.267- 00000000 30 44 02 20 17 0a 19 4c 96 04 31 d0 c4 b6 62 e7 |0D. ...L..1...b.| lnd.log.267- 00000010 51 db dc 79 43 0c 9c 83 83 b4 44 5f f0 4c d0 da |Q..yC.....D_.L..| lnd.log.267- 00000020 b9 08 cc 70 02 20 6e ee 9b 90 7e d0 ce e0 a6 43 |...p. n...~....C| lnd.log.267- 00000030 cd 18 8e 10 3e 72 84 df e4 63 35 51 6e c3 a3 2a |....>r...c5Qn..| lnd.log.267- 00000040 4b 9f e7 60 8c b2 01 |K..`...| lnd.log.267- }, lnd.log.267- ([]uint8) (len=72 cap=144) { lnd.log.267- 00000000 30 45 02 21 00 b0 34 dd 7b ef 08 27 7b 9b 53 70 |0E.!..4.{..'{.Sp| lnd.log.267- 00000010 20 43 9c a2 14 2e e1 9e ed 17 c4 52 c5 de 2c e0 | C.........R..,.| lnd.log.267- 00000020 6c 85 fe 9a 46 02 20 6d ac 86 1d 88 86 6d 2f f9 |l...F. m.....m/.| lnd.log.267- 00000030 31 43 a4 4f 1b f3 7c 52 fc 35 17 74 fb 98 69 33 |1C.O..|R.5.t..i3| lnd.log.267- 00000040 c4 9c 9a 8b 72 92 15 01 |....r...| lnd.log.267- }, lnd.log.267- ([]uint8) (len=71 cap=500) { lnd.log.267- 00000000 52 21 03 c2 03 fc 9a 24 6d 43 0c c7 8f 91 ec f5 |R!.....$mC......| lnd.log.267- 00000010 83 7b cd 36 17 83 c9 ef fe d0 2e 2f 35 ab af c6 |.{.6......./5...| lnd.log.267- 00000020 ac 88 fc 21 03 e1 53 ae 01 53 a7 62 68 d8 a0 4e |...!..S..S.bh..N| lnd.log.267- 00000030 0d 6f 07 94 72 9d 26 53 33 75 c5 af 70 44 f4 e9 |.o..r.&S3u..pD..| lnd.log.267- 00000040 9a bc e3 d0 2b 52 ae |....+R.| lnd.log.267- } lnd.log.267- }, lnd.log.267- Sequence: (uint32) 2157310309 lnd.log.267- }) lnd.log.267- }, lnd.log.267- TxOut: ([]wire.TxOut) (len=5 cap=5) { lnd.log.267- (wire.TxOut)(0x400be6cf40)({ lnd.log.267- Value: (int64) 330, lnd.log.267- PkScript: ([]uint8) (len=34 cap=34) { lnd.log.267- 00000000 00 20 25 14 8e 4a b4 62 f3 42 d4 12 d1 84 2e 2a |. %..J.b.B.....| lnd.log.267- 00000010 66 1b 5a 58 c0 bc d4 9c c3 10 0d 85 ec a3 38 e1 |f.ZX..........8.| lnd.log.267- 00000020 5e 0a |^.| lnd.log.267- } lnd.log.267- }), lnd.log.267- (wire.TxOut)(0x400be6cfc0)({ lnd.log.267- Value: (int64) 330, lnd.log.267- PkScript: ([]uint8) (len=34 cap=34) { lnd.log.267- 00000000 00 20 5f 8c b3 da 52 cf f2 cb 3f c5 0c c5 6d 02 |. _...R...?...m.| lnd.log.267- 00000010 af f1 10 28 47 73 e4 21 f2 ed 0f 7a bb 4a bf c2 |...(Gs.!...z.J..| lnd.log.267- 00000020 5e 12 |^.| lnd.log.267- } lnd.log.267- }), lnd.log.267- (wire.TxOut)(0x400be6cfe0)({ lnd.log.267- Value: (int64) 8177, lnd.log.267- PkScript: ([]uint8) (len=34 cap=34) { lnd.log.267- 00000000 00 20 a9 15 ad 5d 99 e4 31 b5 69 0d 9c 1f ce 52 |. ...]..1.i....R| lnd.log.267- 00000010 01 6b 1b 13 07 ce 1d e1 f0 86 6f d0 cd bf c2 1e |.k........o.....| lnd.log.267- 00000020 8e 3d |.=| lnd.log.267- } lnd.log.267- }), lnd.log.267- (wire.TxOut)(0x400be6d000)({ lnd.log.267- Value: (int64) 715180, lnd.log.267- PkScript: ([]uint8) (len=34 cap=34) { lnd.log.267- 00000000 00 20 3c a4 49 4e d1 cd 41 d3 63 02 41 83 41 8b |. <.IN..A.c.A.A.| lnd.log.267- 00000010 a4 af c8 97 6b 23 96 c4 d1 64 94 17 0d b9 5d a1 |....k#...d....].| lnd.log.267- 00000020 05 b7 |..| lnd.log.267- } lnd.log.267- }), lnd.log.267- (*wire.TxOut)(0x400be6d020)({ lnd.log.267- Value: (int64) 4744777, lnd.log.267- PkScript: ([]uint8) (len=34 cap=34) { lnd.log.267- 00000000 00 20 40 6d a3 35 ab b4 05 e3 bd 7d cd 15 62 a6 |. @m.5.....}..b.| lnd.log.267- 00000010 43 64 7d ad 81 58 0f e3 c7 7e 39 89 28 1d 08 a6 |Cd}..X...~9.(...| lnd.log.267- 00000020 9e fa |..| lnd.log.267- } lnd.log.267- }) lnd.log.267- }, lnd.log.267- LockTime: (uint32) 553439807 lnd.log.267-})

Roasbeef commented 2 years ago

This PR fixes a known class of "unable to complete dance" errors: https://github.com/lightningnetwork/lnd/pull/6518