lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.71k stars 2.09k forks source link

[bug]: Channel force close due to short database outage and resulting channel failure #8531

Open MasterOfPearls opened 8 months ago

MasterOfPearls commented 8 months ago

Background

node masterofpearls.net experienced a short db outage of 13 seconds, likely due to the system rebooting postgresql (reason undefined), during which a total of 4 channels were processing update requests to their channel state. Two recovered and are alive and well. Two got force closed, one after 3 hours, the other one after close to a week.

Your environment

Steps to reproduce

A test environment would run channels and fail communication at every possible update step e.g. by stopping db communication and restore it after a specified time. Then monitor whether the communication can continue or the channel can be mended.

Expected behavior

Since the LND node did not shut down and continued normal operation, the expected behavior is that the communication between channel partner nodes should continue and restart from an earlier checkpoint to resolve any necessary updates. This did not happen for the mentioned two channels for an undefined reason. It would be desirable to have a more robust in-flight channel mending strategy when communication fails at arbitrary points.

Actual behavior

The communication between channel partner nodes got stuck in a bad state and resulted in the channels to get force closed.

Details:

affected node: 029efe15ef5f0fcc2fdd6b910405e78056b28c9b64e1feff5f13b8dce307e67cad lnd-linux-amd64-v0.16.3-beta (at the time of bug) (using postgresql as db backend with sync backup)

affected channels: 809410x1386x3 770258x2098x2

log entries indicating db outage (total time offline: 13 seconds):

lnd.log.1400.gz:2024-02-27 06:51:58.440 [ERR] HSWC: ChannelLink(2f3c8d42b504fe1efb650f9a7a510479af27357a0ccbf7f5ced8c4ea5110b876:1): failing link: unable to accept revocation: failed to connect to host=localhost user=ln database=lnd_adam: server error (FATAL: the database system is shutting down (SQLSTATE 57P03)) with error: invalid revocation ... lnd.log.1400.gz:2024-02-27 06:52:11.173 [ERR] HSWC: ChannelLink(67c6e33712a1f0f179c49a5988d2366a9a3ad21ed167b46e39cf673bacf41ee9:0): unable to revoke commitment: failed to connect to host=localhost user=ln database=lnd_adam: server error (FATAL: the database system is starting up (SQLSTATE 57P03))

log entries indicating channel failure (809410x1386x3):

2024-02-27 06:51:58.814 [ERR] CRTR: Failed to update mission control store: failed to connect to host=localhost user=ln database=lnd_adam: dial error (dial tcp 127.0.0.1:5432: connect: connection refused) 2024-02-27 06:51:58.977 [ERR] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): unable to revoke commitment: failed to connect to host=localhost user=ln database=lnd_adam: dial error (dial tcp 127.0.0.1:5432: connect: connection refused) 2024-02-27 06:51:59.196 [ERR] HSWC: Unable to ack batch of settle/fails: failed to connect to host=localhost user=ln database=lnd_adam: dial error (dial tcp 127.0.0.1:5432: connect: connection refused) ... 2024-02-27 06:52:17.625 [ERR] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): failing link: ChannelPoint(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): received error from peer: chan_id=ef9136ad71b03dd4556bb2ff0078c54b1e75a15ab1d8b25073a30be027abbba6, err=invalid commitment with error: remote error 2024-02-27 06:52:17.625 [ERR] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): link failed, exiting htlcManager 2024-02-27 06:52:17.625 [INF] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): exited 2024-02-27 06:52:17.625 [INF] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): stopping ... 2024-02-27 07:03:23.201 [INF] NANN: Announcing channel(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3) disabled [detected] ... 2024-02-27 09:59:58.113 [ERR] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): failing link: unable to complete dance with error: remote unresponsive 2024-02-27 09:59:58.113 [INF] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): exited ... 2024-03-04 04:18:26.337 [INF] CNCT: ChannelArbitrator(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): remote party has closed channel out on-chain

log entries indicating channel failure (770258x2098x2):

2024-02-27 06:51:59.478 [ERR] HSWC: ChannelLink(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2): failing link: unable to accept revocation: failed to connect to host=localhost user=ln database=lnd_adam: dial error (dial tcp 127.0.0.1:5432: connect: connection refused) with error: invalid revocation 2024-02-27 06:51:59.478 [ERR] HSWC: ChannelLink(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2): link failed, exiting htlcManager 2024-02-27 06:51:59.478 [INF] HSWC: ChannelLink(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2): stopping 2024-02-27 06:51:59.478 [INF] HSWC: ChannelLink(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2): exited ... 2024-02-27 07:02:16.779 [INF] NANN: Announcing channel(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2) disabled [detected] ... 2024-02-27 09:43:29.281 [INF] HSWC: ChannelLink(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2): attempting to re-synchronize 2024-02-27 09:43:35.235 [WRN] HSWC: ChannelLink(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2): error when syncing channel states: first message sent to sync should be ChannelReestablish, instead received: lnwire.Error 2024-02-27 09:43:35.235 [ERR] HSWC: ChannelLink(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2): failing link: unable to synchronize channel states: first message sent to sync should be ChannelReestablish, instead received: lnwire.Error with error: unable to resume channel, recovery required 2024-02-27 09:43:35.236 [INF] HSWC: ChannelLink(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2): stopping ... 2024-02-27 09:54:11.869 [INF] CNCT: Unilateral close of ChannelPoint(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2) detected 2024-02-27 09:54:11.870 [INF] CNCT: ChannelArbitrator(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2): remote party has closed channel out on-chain 2024-02-27 09:57:00.553 [INF] CNCT: ChannelArbitrator(3174b1e2eb07919f3d4208696de4a9f920dd60c131f0c8a234e685f4dab753cf:2): immediately failing htlc=7584940d66fa538346248132a4167c446bd53dece72ffcd7eac1a4a5f3cc8499 from remote commitment

C-Otto commented 8 months ago

I have more logs for one of those channels (809410x1386x3, a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3). Note that the close transaction only confirmed much later because I tweaked my code in order to save some sats. This is unrelated to this issue, though.

My logs are 6 hours ahead, i.e. 12:52 in my logs corresponds to 06:52 in @MasterOfPearls's logs.

The interesting bit:

2024-02-27 12:52:17.254 [ERR] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): failing link: ChannelPoint(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): unable to accept new commitment: number of htlc sig mismatch. Expected 9 sigs, got 10 with error: invalid commitment

Everything (in that timeframe):

2024-02-27 12:51:30.129 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=13878439)->(Chan ID=809410:1386:3, HTLC ID=190612)
2024-02-27 12:51:31.067 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=13878440)->(Chan ID=809410:1386:3, HTLC ID=190613)
2024-02-27 12:51:33.030 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): settle-fail-filter &{1 [0]}
2024-02-27 12:51:33.030 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): Failed to send 1000004500 mSAT
2024-02-27 12:51:33.078 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): removing Add packet (Chan ID=0:0:0, HTLC ID=13878438) from mailbox
2024-02-27 12:51:33.078 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): removing Add packet (Chan ID=0:0:0, HTLC ID=13878439) from mailbox
2024-02-27 12:51:33.078 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): removing Add packet (Chan ID=0:0:0, HTLC ID=13878440) from mailbox
2024-02-27 12:51:33.352 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=13878441)->(Chan ID=809410:1386:3, HTLC ID=190614)
2024-02-27 12:51:34.028 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=13878442)->(Chan ID=809410:1386:3, HTLC ID=190615)
2024-02-27 12:51:34.447 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): settle-fail-filter &{1 [0]}
2024-02-27 12:51:34.447 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): Failed to send 50005650 mSAT
2024-02-27 12:51:34.512 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): removing Add packet (Chan ID=0:0:0, HTLC ID=13878441) from mailbox
2024-02-27 12:51:34.513 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): removing Add packet (Chan ID=0:0:0, HTLC ID=13878442) from mailbox
2024-02-27 12:51:35.249 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=13878443)->(Chan ID=809410:1386:3, HTLC ID=190616)
2024-02-27 12:51:51.698 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): removing Add packet (Chan ID=0:0:0, HTLC ID=13878443) from mailbox
2024-02-27 12:51:57.345 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): settle-fail-filter &{5 [0]}
2024-02-27 12:51:57.345 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): Failed to send 350114602 mSAT
2024-02-27 12:51:57.345 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): Failed to send 600016600 mSAT
2024-02-27 12:51:57.345 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): Failed to send 350010100 mSAT
2024-02-27 12:51:57.345 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): Failed to send 500153500 mSAT
2024-02-27 12:51:57.345 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): Failed to send 50013650 mSAT
2024-02-27 12:51:58.328 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): settle-fail-filter &{1 [0]}
2024-02-27 12:51:58.328 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): Failed to send 1000022000 mSAT
2024-02-27 12:51:58.741 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=13878445)->(Chan ID=809410:1386:3, HTLC ID=190617)
2024-02-27 12:51:58.813 [DBG] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): removing Add packet (Chan ID=0:0:0, HTLC ID=13878445) from mailbox
2024-02-27 12:52:17.254 [ERR] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): failing link: ChannelPoint(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): unable to accept new commitment: number of htlc sig mismatch. Expected 9 sigs, got 10 with error: invalid commitment
2024-02-27 12:52:17.254 [ERR] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): link failed, exiting htlcManager
2024-02-27 12:52:17.254 [INF] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): exited
2024-02-27 12:52:17.255 [INF] HSWC: ChannelLink(a5bbab27e00ba37350b2d8b15aa1751e4bc57800ffb26b55d43db071ad3691ef:3): stopping
2024-02-27 12:52:17.256 [WRN] PEER: Peer(029efe15ef5f0fcc2fdd6b910405e78056b28c9b64e1feff5f13b8dce307e67cad): Force closing link(809410:1386:3)
C-Otto commented 8 months ago

I got this from the other node, running CLN:

2024-02-27T11:50:35.238Z UNUSUAL 029efe15ef5f0fcc2fdd6b910405e78056b28c9b64e1feff5f13b8dce307e67cad-chan#22823: Peer permanent failure in CHANNELD_NORMAL: channeld: received ERROR channel cf53b7daf485e634a2c8f031c160dd20f9a9e46d6908423d9f9107ebe2b17433: invalid revocation