Closed Roasbeef closed 2 months ago
Perhaps this is actually the issue: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/channel.go#L4701-L4706
It's gated on lack of aux signer there, instead of blob and aux signer.
Another hunch here is that we may be missing proper cancel/quit channels in this context. Which is then preventing a proper shutdown of a channel.
Another relevant part I saw in the goroutine stack dump when trimming it a bit:
233 goroutine 1555 [semacquire, 1055 minutes]:
1 sync.runtime_Semacquire(0xc0053ad098?)
2 runtime/sema.go:62 +0x25
3 sync.(*WaitGroup).Wait(0xc00161c3c0?)
4 sync/waitgroup.go:116 +0x48
5 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Stop(0x20324a0?)
6 github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/channel_arbitrator.go:837 +0x195
7 github.com/lightningnetwork/lnd/contractcourt.(*ChainArbitrator).ResolveContract(0xc000370d88, {{0xea, 0x94, 0x46, 0x3f, 0x1d, 0xd2, 0x74, 0xbb, 0x99, ...}, ...})
8 github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/chain_arbitrator.go:527 +0x21c
9 github.com/lightningnetwork/lnd/contractcourt.newActiveChannelArbitrator.func5()
10 github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/chain_arbitrator.go:463 +0x78
11 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).stateStep(0xc002a97888, 0xd1ac8, 0x4, 0x0?)
12 github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/channel_arbitrator.go:1290 +0x1129
13 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).advanceState(0xc002a97888, 0xd1ac8, 0x4, 0x0)
14 github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/channel_arbitrator.go:1615 +0x165
15 github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).channelAttendant(0xc002a97888, 0xd1a96)
16 github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/channel_arbitrator.go:2855 +0x5ac
17 created by github.com/lightningnetwork/lnd/contractcourt.(*ChannelArbitrator).Start in goroutine 507
18 github.com/lightningnetwork/lnd@v0.18.0-beta.rc4.0.20240730143253-1b353b0bfd58/contractcourt/channel_arbitrator.go:570 +0x66e
19
Full goroutine trace is here: https://0bin.net/paste/H8EBWoac#1FnM7HsS3kPWLeRlPisbgVPeU1fux+-HeTCA9eytFgs
Re cancel chan, we pass one in here: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L74
but then don't select on it for the send in tapd
, nor the recv in lnd
.
the receive should have a sigjob?
Is the use it uniform (quit time out flow)
@jharveyb will provide a context update
Perhaps this is actually the issue: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/channel.go#L4701-L4706
It's gated on lack of aux signer there, instead of blob and aux signer.
I think if the Blob is None
, the tapd behavior is safe; the sigjob resp would have a None
SigBlob, so later on when the sigs are sent back to tapd to be packed into a tlv Blob, the signature would be skipped. Though that is also the point where we are blocking:
Re cancel chan, we pass one in here: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L74
but then don't select on it for the send in tapd, nor the recv in lnd
I agree that this seems like the most likely root cause. I've been comparing the behavior for aux sigs against what happens for htlc sigs. AFAICT, the htlcsig jobs are made in the same order as the aux sigs, and then submitted for processing here:
Those jobs are processed here:
Where we do skip a job if the cancel channel was closed, and skip all jobs if quit
is received by that worker (not sure where that signal would come from).
Jobs submission also exits early on cancel or quit:
It seems like jobs would be cancelled by the submitter, in case of other errors:
And that cancel channel is shared amongst all jobs.
For the aux sigs, the flow is different in a few ways. Firstly, the job submission func returns immediately:
Within the job processing goroutine, we only check for Quit
once instead of at every point where we try to send on a channel:
And we never check the job's cancel channel. So updating the tapd side to catch those signals seems like a good first fix.
I think the proper handling of Quit
should be unit testable from the tapd
repo.
Another bit I'm confused about is the sorting of sig jobs. AFAICT they are generated in order of (incoming HTLCs, outbound HTLCs), submitted to other goroutines, sorted in-place by output index, and then waited on.
I think the sorting after submission could cause extra problems with signal handling. Example:
auxSigBatch
will be the last job in that slice after BIP 69 sorting.tapd
receives a Quit
signal while processing that first job; a cancel signal is sent, and a sigjob resp
with an error.lnd
is now waiting on a response for a job that will never get processed and hangs. The error sent by tapd
will also never be received.I think adding the cancel and quit handling on both sides would address this, but I'm confused as to why we don't sort the jobs right after generation so that the wait order matches the processing order.
This is great debugging! Should we resolve the job sorting question next?
Ok, following on from review in #1118 :
tapd
and lnd
. This requires changes in both systems to make this safe.My current change is a working fix, but I think there are some better options:
Use a ctx.Context
with cancelFunc
. Under the hood, this is a channel guarded with a mutex. It checks that cancelFunc only has a side effect exactly once. A bit heavy handed here, as we don't need any context inheritance.
Use a sync.Once
to close the channel. This will save use from a possible double close, without adding overhead from atomic Load()
that's used in the Context
package.
To pursue that second option, we'll need a follow-up PR in lnd
to change the types lnwallet.AuxSigJob
and lnwallet.SignJob
. That can be followed by a PR in tapd
to use the new job type.
I'll add these changes to https://github.com/lightningnetwork/lnd/pull/9074 and then #1118 .
@jharveyb looking at this more closely, I don't think we need either of those solutions, as the validator is never expected to close that channel. See https://github.com/lightningnetwork/lnd/pull/9074#discussion_r1754256807.
Writing back offline discussion:
Sync'd with Oli, and I agree that we can solve this by:
tapd
skip all pending jobs if any job errs.cancel
channel.Then for lnd
:
Will update the existing PRs in-place.
Given the above, do we have an answer to this question: https://github.com/lightninglabs/taproot-assets/pull/1118#pullrequestreview-2291168789?
Also re context cancel, looks to be thread safe: https://cs.opensource.google/go/go/+/refs/tags/go1.23.1:src/context/context.go;l=665-677
Given the above, do we have an answer to this question: #1118 (review)?
From the stack trace, is it certain that tapd
was hanging on that send, or just that that's where that goroutine was during shudown?
Background
A user has noticed that occasionally the daemon may freeze up when signing second level HTLCs.
We we're able to obtain a trace to confirm that the daemon was indeed deadlocked, with
lnd
waiting for a channel response fromtapd
, whiletapd
is attempting to send a response tolnd
:Here's a diagram from Clade based on the trace above:
From the above, we can see that we have circular waiting dependancy.
As is, the channel created for the transfer of sigs is always buffered: https://github.com/lightningnetwork/lnd/blob/e78b0bb8532dd181cfa112790113335d65937e37/lnwallet/aux_signer.go#L58-L76
We're then blocking here in
tapd
: https://github.com/lightninglabs/taproot-assets/blob/0551a3f6c147c085b1e520ae192806f15873e24f/tapchannel/aux_leaf_signer.go#L265-L273One thing to note is that in
tapd
, this is the send when there's no aux blob for a channel.Expected behavior
Buffered channel send never blocks.
Actual behavior
Buffered channel send blocks. Potentially there's some underlying mutation here.