[bug]: Potential Deadlock in HodlInvoice logic

ziggie1984 commented 1 month ago

Version: LND 17.3

Received a goroutine dump where the invoice system was in the deadlock. I could not reproduce locally and could also not come up with a reasonable explanation why the hodlqueue.ChanIn() is not avaialable.

Let's first look at the goroutine dump:

goroutine 42960673 [select, 246 minutes]: 1 times: [[42960673]
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).notifyHodlSubscribers(0xc000364240, {0x19b18e0, 0xc073120620})
        github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1714 +0x271
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).cancelInvoiceImpl(0xc000364240, {0x71, 0xc8, 0x51, 0xbd, 0x11, 0x3a, 0x5e, 0x63, 0xbc, ...}, ...)
        github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1371 +0x5ac
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).CancelInvoice(...)
        github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1293
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.(*Server).CancelInvoice(0xc0073d8370, {0x280?, 0xc01daf56f8?}, 0x1189079?)
        github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_server.go:309 +0x96
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc._Invoices_CancelInvoice_Handler.func1({0x19c15f8?, 0xc0576bc420?}, {0x154f260?, 0xc02c290c40?})
        github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_grpc.pb.go:207 +0xcb
...

This is the call which initiates the deadlock, the notifyHodlSubscribers cannot write into the hodlqueue channel of the specific channel link, locking the hodlSubscriptionsMux forever.

Moreover the cancelInvoiceImpl locks the InvoiceRegistry mutex as well so both mutexs are locked.

https://github.com/lightningnetwork/lnd/blob/13aa7f99248c7ee63989d3b62e0cbfe86d7b0964/invoices/invoiceregistry.go#L1713-L1724

So somehow the hodlqueue channel is never read, which only happens when we somehow stopped the concurrent queue otherwise the incoming channel is always read in a loop:

https://github.com/lightningnetwork/lnd/blob/613bfc07fb664086e78e8c58c9e1c8785e4656d9/queue/queue.go#L62-L97

So I am waiting for additional goroutine dumps from the noderunner to verify this behaviour and investigate further.

ziggie1984 commented 1 month ago

Goroutine Dump: lnd_issue_8803_goroutine_dump_1.txt

Roasbeef commented 1 month ago

Is this present in 0.18?

ziggie1984 commented 1 month ago

This noderunner still runs 17.3 and will update soon to 18.0, so we don't know yet but he will provide the necessary information.

ziggie1984 commented 1 month ago

Analysing further I think we have a race condition which we need to guard for:

In the peerTerminationWatcher we remove the link hence stopping the channel link (RemoveLink) without making sure that it's not in the process of removing an HTLC (hodl htlc for example):

https://github.com/lightningnetwork/lnd/blob/a832371d61fc566484e4f4ffe64dd37d54aee597/server.go#L4081-L4083

before we should probably make sure that the htlcmanger is shutdown correctly.

https://github.com/lightningnetwork/lnd/commit/ec55831229a2f1699422e97a9f43e0975bdd49d9#diff-9926ec0b86d09ad93deb44807ccee9a52ed42535007f96ae497478c07b76843aL1449-L1466

so the above code base change with Release 18, so this bug might not be part of the code base anymore.

ziggie1984 commented 1 month ago

Though I don't see that we make sure our htlcManager is not in the process of handling a msg hence accepting a new htlc when shutting down the link which will switch off the ConcurrentQueue

I think we need to make sure the htlcManager for a channel link is in a clean state before stopping the channel link.

Maybe we need to signal the quit channel before before stopping the hodlqueue (and wait until the htlcManager goroutine is stopped) so we don't run into this race condition.

https://github.com/lightningnetwork/lnd/blob/a832371d61fc566484e4f4ffe64dd37d54aee597/htlcswitch/link.go#L1443-L1445

https://github.com/lightningnetwork/lnd/blob/a832371d61fc566484e4f4ffe64dd37d54aee597/htlcswitch/link.go#L572-L574

otherwise there might be the race condition where we try to process an HTLC and all of a sudden the peer disconnects and we stop the hodlqueue never being able to read the hodlqueue.ChanIn()

what do you think @ProofOfKeags ?

We potentially might reintroduce the l.shutdownRequest channel maybe to trigger a clean shutdown of the htlcManager ?

ProofOfKeags commented 1 month ago

Oof. Yeah if there is any necessary cleanup that the channelLink has to perform then we should be doing that in the post loop section of the htlcManager thread. I would need to take a closer look at how the hodlQueue works. I'm somewhat concerned that we do direct channel exposure via NotifyExitHopHtlc.

From what I can tell though, I can see a bit of an architectural crack wherein we leak out the hodlQueue go-channel all the way into the invoice registry. The problem with that is that this is essentially a dangling reference that we cannot detect has been destroyed in that context.

I think the "correct" fix of this will have to involve either preventing that go-channel from leaking directly, or also attaching with it some mechanism for being able to tell when the resource dies (a copy of the link's quit channel).

ziggie1984 commented 4 weeks ago

Problem confirmed with LND 18.0, working on a fix. lnd_issue_8803_goroutine_dump_4_lnd_18.txt

ziggie1984 commented 4 weeks ago

I think the "correct" fix of this will have to involve either preventing that go-channel from leaking directly, or also attaching with it some mechanism for being able to tell when the resource dies (a copy of the link's quit channel).

Thank you for the ideas, I will try the first design with just adding the link's quit channel to the subscriber, but maybe we need to really isolate the concurrent queue with Queue and Dequeue methods which guard read and writes against the above behaviour tho I think it might me more work because we use the concurrent queue not just for hodlinvoices, lets discuss based on the first draft.

ProofOfKeags commented 4 weeks ago

but maybe we need to really isolate the concurrent queue with Queue and Dequeue methods which guard read and writes against the above behaviour tho I think it might me more work because we use the concurrent queue not just for hodlinvoices, lets discuss based on the first draft

Big Concept ACK here. I think the point here is that we want to make sure that reading from a pipe whose writers have exited or writing to a pipe whose readers have exited should (ideally) not deadlock but cause some sort of other "exception" on the side that's still live. If you can solve this at the level of the concurrent queue itself that would be a huge win.

Roasbeef commented 4 weeks ago

Also see this issue, back when I fixed a concurrency issue here, dug deeper, and saw many more: https://github.com/lightningnetwork/lnd/issues/8023

ProofOfKeags commented 3 weeks ago

In the process of investigating some of the changes I've been wanting to make it occurs to me that the existence of the hodlQueue within the link itself is another symptom of a fundamental architectural crack I noticed a few months ago. I think patching over it to hunt down this deadlock is fine but I want to take a moment to further educate on the issue.

The following is my opinion but it will be presented as fact so as to make as direct of a case as possible.

The main problem that sits at the bottom of many of the issues we are encountering is an improper division of responsibility between the link and the switch.

Right now we do exit hop processing in the link itself as opposed to the switch. This is a mistake for a number of reasons. First and foremost it means we have to duplicate traffic controls between the links and the switch. Further it requires invoice registry references in every channel link which will cause synchronization issues to proliferate.

Instead what we should be doing is as soon as we have received a revocation from our counterparty we take the batch of remote updates, package them up and hand them to the switch. From there the switch can make the determination as to what to do with it. If we are the recipient and it is not a hodl invoice we would immediately drop an htlc packet back into the link and we are on our merry way. If it is a hodl invoice, the switch can await the confirming signal from the invoice registry before doing so.

There are many more problems that this improper division of responsibility creates but I will save those for a different forum. Changing this architectural issue is a considerable effort and not one we should undertake to solve immediate production issues so in my view, we should fix it in whatever way can work.

I include this detail because I will be making an ongoing case that we need to change the responsibility division between the link and the switch and I keep finding odd side-effects of the current architecture and want to document them.

CC @Roasbeef @yyforyongyu @saubyk

lightningnetwork / lnd

[bug]: Potential Deadlock in HodlInvoice logic #8803