tapchannel - [bug]: potential deadlock when signing second level HTLCs

Roasbeef commented 2 months ago

Background

A user has noticed that occasionally the daemon may freeze up when signing second level HTLCs.

We we're able to obtain a trace to confirm that the daemon was indeed deadlocked, with lnd waiting for a channel response from tapd, while tapd is attempting to send a response to lnd:

goroutine 2042 [chan receive, 44 minutes]:
github.com/lightningnetwork/lnd/lnwallet.(*LightningChannel).SignNextCommitment(0xc006d40360)
    github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/lnwallet/channel.go:4706 +0x124c
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTx(0xc00280ec08)
    github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/htlcswitch/link.go:2534 +0x6a
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTxOrFail(0xc00280ec08)
    github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/htlcswitch/link.go:2485 +0x1f
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).handleUpstreamMsg(0xc00280ec08, {0x389c7b0, 0xc00faed740})
    github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/htlcswitch/link.go:2367 +0xa96
github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).htlcManager(0xc00280ec08)
    github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/htlcswitch/link.go:1415 +0x1596
created by github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).Start in goroutine 1680
    github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/htlcswitch/link.go:541 +0x47d

goroutine 1324695 [chan send, 44 minutes]:
github.com/lightninglabs/taproot-assets/tapchannel.(*AuxLeafSigner).processAuxSigBatch(0xc005532930, 0xc0049c1008, 0xc00583e800, {0xc0178b0008, 0x2, 0x2?})
    github.com/lightninglabs/taproot-assets@v0.4.2-0.20240824000229-881ecafbeae1/tapchannel/aux_leaf_signer.go:269 +0x1eb
created by github.com/lightninglabs/taproot-assets/tapchannel.(*AuxLeafSigner).SubmitSecondLevelSigBatch in goroutine 2042
    github.com/lightninglabs/taproot-assets@v0.4.2-0.20240824000229-881ecafbeae1/tapchannel/aux_leaf_signer.go:104 +0xcf

Here's a diagram from Clade based on the trace above:

graph TD
    subgraph "goroutine 2042 (lnd)"
        E[htlcManager]
        D[handleUpstreamMsg]
        C[updateCommitTxOrFail]
        B[updateCommitTx]
        A[SignNextCommitment]

        E --> D
        D --> C
        C --> B
        B -->|"chan receive (44 min)"| A
    end

    subgraph "goroutine 1324695 (taproot-assets)"
        G[SubmitSecondLevelSigBatch]
        F[processAuxSigBatch]

        G -->|"chan send (44 min)"| F
    end

    G -->|Creates| F
    G -->|Waiting for| A
    E -->|Waiting for| G

    classDef waiting fill:#FF9999,stroke:#333,stroke-width:2px,color:#000;
    classDef normal fill:#D9EAD3,stroke:#333,stroke-width:1px,color:#000;
    class A,F waiting;
    class B,C,D,E,G normal;

From the above, we can see that we have circular waiting dependancy.

As is, the channel created for the transfer of sigs is always buffered: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L58-L76

We're then blocking here in tapd: https://github.com/lightninglabs/taproot-assets/blob/0551a3f6c147c085b1e520ae192806f15873e24f/tapchannel/aux_leaf_signer.go#L265-L273

One thing to note is that in tapd, this is the send when there's no aux blob for a channel.

Expected behavior

Buffered channel send never blocks.

Actual behavior

Buffered channel send blocks. Potentially there's some underlying mutation here.

Roasbeef commented 2 months ago

Perhaps this is actually the issue: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/channel.go#L4701-L4706

It's gated on lack of aux signer there, instead of blob and aux signer.

Roasbeef commented 2 months ago

Another hunch here is that we may be missing proper cancel/quit channels in this context. Which is then preventing a proper shutdown of a channel.

Another relevant part I saw in the goroutine stack dump when trimming it a bit:

233  goroutine 1555 [semacquire, 1055 minutes]:
   1 sync.runtime_Semacquire(0xc0053ad098?)
   2         runtime/sema.go:62 +0x25
   3 sync.(*WaitGroup).Wait(0xc00161c3c0?)
   4         sync/waitgroup.go:116 +0x48
   5 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Stop(0x20324a0?)
   6         github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/channel_arbitrator.go:837 +0x195
   7 github.com/lightningnetwork/lnd/contractcourt.(*ChainArbitrator).ResolveContract(0xc000370d88, {{0xea, 0x94, 0x46, 0x3f, 0x1d, 0xd2, 0x74, 0xbb, 0x99, ...}, ...})
   8         github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/chain_arbitrator.go:527 +0x21c
   9 github.com/lightningnetwork/lnd/contractcourt.newActiveChannelArbitrator.func5()
  10         github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/chain_arbitrator.go:463 +0x78
  11 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).stateStep(0xc002a97888, 0xd1ac8, 0x4, 0x0?)
  12         github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/channel_arbitrator.go:1290 +0x1129
  13 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).advanceState(0xc002a97888, 0xd1ac8, 0x4, 0x0)
  14         github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/channel_arbitrator.go:1615 +0x165
  15 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).channelAttendant(0xc002a97888, 0xd1a96)
  16         github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/channel_arbitrator.go:2855 +0x5ac
  17 created by github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Start in goroutine 507
  18         github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/channel_arbitrator.go:570 +0x66e
  19

Full goroutine trace is here: https://0bin.net/paste/H8EBWoac#1FnM7HsS3kPWLeRlPisbgVPeU1fux+-HeTCA9eytFgs

Roasbeef commented 2 months ago

Re cancel chan, we pass one in here: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L74

but then don't select on it for the send in tapd, nor the recv in lnd.

dstadulis commented 2 months ago

the receive should have a sigjob?

Is the use it uniform (quit time out flow)

dstadulis commented 2 months ago

@jharveyb will provide a context update

jharveyb commented 2 months ago

Perhaps this is actually the issue: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/channel.go#L4701-L4706

It's gated on lack of aux signer there, instead of blob and aux signer.

I think if the Blob is None, the tapd behavior is safe; the sigjob resp would have a None SigBlob, so later on when the sigs are sent back to tapd to be packed into a tlv Blob, the signature would be skipped. Though that is also the point where we are blocking:

https://github.com/lightninglabs/taproot-assets/blob/881ecafbeae15a0058f6187a03dbd3632da230be/tapchannel/aux_leaf_signer.go#L269

Re cancel chan, we pass one in here: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L74

but then don't select on it for the send in tapd, nor the recv in lnd

I agree that this seems like the most likely root cause. I've been comparing the behavior for aux sigs against what happens for htlc sigs. AFAICT, the htlcsig jobs are made in the same order as the aux sigs, and then submitted for processing here:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/channel.go#L4625

Those jobs are processed here:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/sigpool.go#L191

Where we do skip a job if the cancel channel was closed, and skip all jobs if quit is received by that worker (not sure where that signal would come from).

Jobs submission also exits early on cancel or quit:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/sigpool.go#L278

It seems like jobs would be cancelled by the submitter, in case of other errors:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/channel.go#L4651

And that cancel channel is shared amongst all jobs.

For the aux sigs, the flow is different in a few ways. Firstly, the job submission func returns immediately:

https://github.com/lightningnetwork/lnd/blob/1b353b0bfd58a6dfeafc2878111404048826b41d/lnwallet/channel.go#L4628

Within the job processing goroutine, we only check for Quit once instead of at every point where we try to send on a channel:

https://github.com/lightninglabs/taproot-assets/blob/881ecafbeae15a0058f6187a03dbd3632da230be/tapchannel/aux_leaf_signer.go#L257

And we never check the job's cancel channel. So updating the tapd side to catch those signals seems like a good first fix.

I think the proper handling of Quit should be unit testable from the tapd repo.

Another bit I'm confused about is the sorting of sig jobs. AFAICT they are generated in order of (incoming HTLCs, outbound HTLCs), submitted to other goroutines, sorted in-place by output index, and then waited on.

I think the sorting after submission could cause extra problems with signal handling. Example:

There are 100 jobs, and the first job in auxSigBatch will be the last job in that slice after BIP 69 sorting.
tapd receives a Quit signal while processing that first job; a cancel signal is sent, and a sigjob resp with an error.
lnd is now waiting on a response for a job that will never get processed and hangs. The error sent by tapd will also never be received.

I think adding the cancel and quit handling on both sides would address this, but I'm confused as to why we don't sort the jobs right after generation so that the wait order matches the processing order.

dstadulis commented 2 months ago

This is great debugging! Should we resolve the job sorting question next?

jharveyb commented 2 months ago

Ok, following on from review in #1118 :

The cancel signal can now be sent from both tapd and lnd. This requires changes in both systems to make this safe.
Using a raw channel to send the cancel signal is a bit unwieldy, as we need to avoid a double close that would cause a panic.

My current change is a working fix, but I think there are some better options:

Use a ctx.Context with cancelFunc. Under the hood, this is a channel guarded with a mutex. It checks that cancelFunc only has a side effect exactly once. A bit heavy handed here, as we don't need any context inheritance.
Use a sync.Once to close the channel. This will save use from a possible double close, without adding overhead from atomic Load() that's used in the Context package.

To pursue that second option, we'll need a follow-up PR in lnd to change the types lnwallet.AuxSigJob and lnwallet.SignJob. That can be followed by a PR in tapd to use the new job type.

I'll add these changes to https://github.com/lightningnetwork/lnd/pull/9074 and then #1118 .

guggero commented 2 months ago

@jharveyb looking at this more closely, I don't think we need either of those solutions, as the validator is never expected to close that channel. See https://github.com/lightningnetwork/lnd/pull/9074#discussion_r1754256807.

jharveyb commented 2 months ago

Writing back offline discussion:

Sync'd with Oli, and I agree that we can solve this by:

Having tapd skip all pending jobs if any job errs.
Never close the cancel channel.

Then for lnd:

Skip all pending jobs if any job errs.
Leave in the job sorting, as the above behavior relies on a shared job ordering.

Will update the existing PRs in-place.

Roasbeef commented 2 months ago

Given the above, do we have an answer to this question: https://github.com/lightninglabs/taproot-assets/pull/1118#pullrequestreview-2291168789?

Roasbeef commented 2 months ago

Also re context cancel, looks to be thread safe: https://cs.opensource.google/go/go/+/refs/tags/go1.23.1:src/context/context.go;l=665-677

jharveyb commented 1 month ago

Given the above, do we have an answer to this question: #1118 (review)?

From the stack trace, is it certain that tapd was hanging on that send, or just that that's where that goroutine was during shudown?

lightninglabs / taproot-assets