lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.69k stars 2.08k forks source link

[bug]: open channel fails - tx not broadcast, remote node with stuck pending channel #8327

Closed mrfelton closed 10 months ago

mrfelton commented 10 months ago

Hi, we have a reoccurring scenario in which channel open attempts fail - no channel opening transaction is broadcast, and the channel does not exist in the node's channel database after the opening attempt, but the remote node does still have a record of the channel in its db and has it stuck in a pending open state.

I'd like to understand how/why this happens, and what could be done to prevent it.

Here are some logs related to one of these events (we have had at least three of them in the last couple of weeks).

2023-12-22 03:10:56.304 [INF] FNDG: Initiating fundingRequest(local_amt=0.56746331 BTC (subtract_fees=false), push_amt=0 mSAT, chain_hash=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f, peer=034d7f4bbbd6c1c1d8fbe0a42dd1f59e10b66540c6872dfcaa095d8d5cffebcf46, min_confs=1)
2023-12-22 03:10:56.331 [INF] FNDG: Target commit tx sat/kw for pendingID(fd5f716bb39d8bc5e863926a6744e5a7a3f8d8a3ed7a7b840ca9cfa67e760497): 2500
2023-12-22 03:10:56.331 [INF] FNDG: Dust limit for pendingID(fd5f716bb39d8bc5e863926a6744e5a7a3f8d8a3ed7a7b840ca9cfa67e760497): 0.00000354 BTC
2023-12-22 03:10:56.331 [INF] FNDG: Starting funding workflow with 34.138.210.19:9735 for pending_id(fd5f716bb39d8bc5e863926a6744e5a7a3f8d8a3ed7a7b840ca9cfa67e760497), committype=anchors-zero-fee-second-level
2023-12-22 03:10:56.387 [INF] FNDG: Recv'd fundingResponse for pending_id(fd5f716bb39d8bc5e863926a6744e5a7a3f8d8a3ed7a7b840ca9cfa67e760497)
2023-12-22 03:10:56.387 [INF] FNDG: pendingChan(fd5f716bb39d8bc5e863926a6744e5a7a3f8d8a3ed7a7b840ca9cfa67e760497): remote party proposes num_confs=6, csv_delay=2016
2023-12-22 03:10:56.494 [INF] FNDG: Generated ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0) for pending_id(fd5f716bb39d8bc5e863926a6744e5a7a3f8d8a3ed7a7b840ca9cfa67e760497)
2023-12-22 03:10:56.809 [INF] CNCT: Creating new ChannelArbitrator for ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0)
2023-12-22 03:10:56.813 [INF] CNCT: ChannelArbitrator(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0): starting state=StateDefault, trigger=chainTrigger, triggerHeight=822328
2023-12-22 03:10:56.814 [INF] NTFN: New spend subscription: spend_id=330, outpoint=dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0, script=0 9c1cc94d51724624ba3b0a7e89f3e1cfd1fc199972b4a46cc7356242b4a57df8, height_hint=822328
2023-12-22 03:10:56.814 [INF] NTFN: Dispatching historical spend rescan for outpoint=dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0, script=0 9c1cc94d51724624ba3b0a7e89f3e1cfd1fc199972b4a46cc7356242b4a57df8, start=822328, end=822328
2023-12-22 03:10:56.831 [INF] FNDG: Finalizing pending_id(fd5f716bb39d8bc5e863926a6744e5a7a3f8d8a3ed7a7b840ca9cfa67e760497) over ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0), waiting for channel open on-chain
2023-12-22 03:10:56.831 [INF] CNCT: Close observer for ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0) active
2023-12-22 03:10:56.833 [INF] NTFN: New confirmation subscription: conf_id=243, txid=dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e, num_confs=6 height_hint=822328
2023-12-22 03:10:56.833 [INF] FNDG: Waiting for funding tx (dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e) to reach 6 confirmations
2023-12-22 03:10:56.843 [INF] NTFN: Historical spend dispatch finished for request outpoint=dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0, script=0 9c1cc94d51724624ba3b0a7e89f3e1cfd1fc199972b4a46cc7356242b4a57df8 (start=822328 end=822328) with details: <nil>
2023-12-22 03:10:56.863 [INF] LNWL: Inserting unconfirmed transaction dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e

Replacing disk backup for ChannelPoint(.....:1) w/ newer version
.... [repeats for every channel]

2023-12-22 03:10:56.934 [INF] CHBU: Updating backup file at /lnd/.lnd/data/chain/bitcoin/mainnet/channel.backup
2023-12-22 03:10:56.938 [INF] CHBU: Swapping old multi backup file from /lnd/.lnd/data/chain/bitcoin/mainnet/temp-dont-use.backup to /lnd/.lnd/data/chain/bitcoin/mainnet/channel.backup
2023-12-22 03:10:56.942 [INF] LNWL: Removed invalid transaction: (*wire.MsgTx)(0xc022e77080)({
Version: (int32) 2,
TxIn: ([]*wire.TxIn) (len=1 cap=1) {
(*wire.TxIn)(0xc0171f12c0)({
PreviousOutPoint: (wire.OutPoint) dede9fae864e2b4c00a320c74c5afd5b65ccfb5e9e270feb77dd35e21bbadf91:0,
SignatureScript: ([]uint8) {
},
Witness: (wire.TxWitness) (len=1 cap=1) {
([]uint8) (len=64 cap=64) {
00000000 49 33 32 46 2e 0a 41 ef a3 fa c6 bb 0a 07 08 d6 |I32F..A.........|
00000010 f6 03 ea bf fe e5 8c 7e 9d ad 30 d8 da ae e8 94 |.......~..0.....|
00000020 90 13 f8 a4 6a ca e4 e1 49 f4 af e3 01 51 35 06 |....j...I....Q5.|
00000030 1e 63 64 54 2b 63 22 40 5a 60 6d 72 d2 d6 44 47 |.cdT+c"@Z`mr..DG|
}
},
Sequence: (uint32) 4294967295
})
},
TxOut: ([]*wire.TxOut) (len=2 cap=2) {
(*wire.TxOut)(0xc022e770c0)({
Value: (int64) 56746331,
PkScript: ([]uint8) (len=34 cap=34) {
00000000 00 20 9c 1c c9 4d 51 72 46 24 ba 3b 0a 7e 89 f3 |. ...MQrF$.;.~..|
00000010 e1 cf d1 fc 19 99 72 b4 a4 6c c7 35 62 42 b4 a5 |......r..l.5bB..|
00000020 7d f8 |}.|
}
}),
(*wire.TxOut)(0xc022e770e0)({
Value: (int64) 93553143,
PkScript: ([]uint8) (len=22 cap=22) {
00000000 00 14 86 dd 1c 3a b1 26 be 4c 70 a0 99 c1 fa 1e |.....:.&.Lp.....|
00000010 39 a6 c1 63 bd 60 |9..c.`|
}
})
},
LockTime: (uint32) 0
})
{}
2023-12-22 03:10:56.942 [ERR] RPCS: [/walletrpc.WalletKit/PublishTransaction]: transaction rejected: output already spent

....

2023-12-22 03:10:57.007 [INF] CNCT: Marking ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0) fully resolved
2023-12-22 03:10:57.011 [WRN] CHFT: could not remove channel: channel: dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0 not present
2024-01-01 03:26:54.844 [ERR] FNDG: Unable to advance pending state of ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0): error waiting for funding confirmation for ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0): funding manager shutting down
2024-01-01 03:26:54.844 [WRN] FNDG: fundingManager shutting down, stopping funding flow for ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0)

After this, there seems to be no record of the channel or it's funding transaction on the node or in the mempool.

However, the channel shows up on the remote node as pending open:

{
    "channel":  {
        "remote_node_pub":  "027cd974e47086291bb8a5b0160a889c738f2712a703b8ea939985fd16f3aae67e",
        "channel_point":  "dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0",
        "capacity":  "56746331",
        "local_balance":  "0",
        "remote_balance":  "56742861",
        "local_chan_reserve_sat":  "567463",
        "remote_chan_reserve_sat":  "567463",
        "initiator":  "INITIATOR_REMOTE",
        "commitment_type":  "ANCHORS",
        "num_forwarding_packages":  "0",
        "chan_status_flags":  "",
        "private":  false,
        "memo":  ""
    },
    "commit_fee":  "2810",
    "commit_weight":  "772",
    "fee_per_kw":  "2500",
    "funding_expiry_blocks":  529
}

After restarting the node some time later, some log messages about this channel do show up during the restart, although there is no change with the channel state (or lack of it).

2024-01-01 03:26:54.844 [ERR] FNDG: Unable to advance pending state of ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0): error waiting for funding confirmation for ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0): funding manager shutting down
2024-01-01 03:26:54.844 [WRN] FNDG: fundingManager shutting down, stopping funding flow for ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0)

I believe the remote node will eventually forget about this channel after some time (~2 weeks from the opening attempt), however it's problematic for us as it leaves the remote node in a state where it thinks there is a pending channel, which counts towards maxpendingchannels and can result in the remote node rejecting additional channel opens from the node.

It seems that the channel opening transaction will never get broadcast - so is this channel safe to abandon on the remote side in order to clear it from their pending channels list?

Why could this have happened in the first place? Thanks!

Is the transaction rejected: output already spent log entry related? I can't tell because I can't finds anything in that log message, or in the Removed invalid transaction: (*wire.MsgTx)(0xc022e77080) log message that proceeds it that correlates with the channel open request, other than the fact that it occurred right during the middle of the channel open attempt at the same timestamp.

If it is related to the doublespend, how can we prevent channel opening attempts from double spends that cause failures or detect this and self heal more gracefully?

yyforyongyu commented 10 months ago

I find this log abnormal,

2023-12-22 03:10:57.007 [INF] CNCT: Marking ChannelPoint(dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e:0) fully resolved

As this is a channel opening and nothing should be resolved by contract court. Could you do a search over dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e and post all the relevant logs?

mrfelton commented 10 months ago

Could you do a search over dbe8a0325a9ddb75489e446b01b9f96d019809171c2fda44374ad40b0da7cd2e and post all the relevant logs?

that is what the logs above are. Those show all references to that channel point and the pending channel id

yyforyongyu commented 10 months ago

The above log means this line was executed, https://github.com/lightningnetwork/lnd/blob/ab8fb2c9c336232f86fa6bdbcac09027ec46c39d/contractcourt/chain_arbitrator.go#L442

which is a logging generated from the method ResolveContract, but we know it's not from contractcourt here, because there's no relevant logging:

https://github.com/lightningnetwork/lnd/blob/ab8fb2c9c336232f86fa6bdbcac09027ec46c39d/contractcourt/channel_arbitrator.go#L1277-L1280

there's another occurrence in contractcourt, which is when the channel gets resolved, but since it's not force closed, so that's also not the case.

Then the only place is in rpcserver, inside abandonChan, https://github.com/lightningnetwork/lnd/blob/ab8fb2c9c336232f86fa6bdbcac09027ec46c39d/rpcserver.go#L2793

and this can only happen via RPC BatchOpenChannel or AbandonChannel. So what commands did you use to open the channel and, do you have any occurrence of these RPC calls in your logs?

mrfelton commented 10 months ago

~The channel was opened with the OpenChannel RPC. Neither BatchOpenChannel or AbandonChannel was called.~

EDIT: The channel was opened using a PSBT flow. And looking at our code it is possible that abandon channel could have been called as we do call this in the event that we detected that a channel opening attempt failed. Will need to do some more digging in our application logs to determine the exact sequence of events.

mrfelton commented 10 months ago

A little more digging in our application logs shows that indeed it does look related to the double spend. It appears that the channel opening attempt failed due the double spend - I see transaction rejected: output already spent error from when the PSBT to open the channel was published.

Then, as part of an automated cleanup it seems we have called abandon channel. We do this because we have found that if we do not abandon the failed channel opening attempt, it will remain stuck in a pending state forever.

ziggie1984 commented 10 months ago

Do you use the psbt flow in combination with the lnd psbt fund cmd, otherwise pretty strange why it used an already spent output.

mrfelton commented 10 months ago

Yes, It's done using the WalletKit FundPsbt RPC. I suspect we may have had a race condition with another transaction attempting to use the same inputs.

ziggie1984 commented 10 months ago

Yes either that, or try not to use unconfirmed outputs for now because I think it has to do with the sweeping of the anchor output which is RBFed by a third party

mrfelton commented 10 months ago

Ok, interesting. We do currently set SpendUnconfirmed when funding the Psbt.

mrfelton commented 10 months ago

After updating to not spend unconfirmed outputs we no longer see this problem.

I think it has to do with the sweeping of the anchor output which is RBFed by a third party

Is there an existing issue that tracks the root problem here?

ziggie1984 commented 10 months ago

https://github.com/lightningnetwork/lnd/pull/7939

guggero commented 10 months ago

Closing in favor of https://github.com/lightningnetwork/lnd/issues/7602 (identified as the root problem).