lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.65k stars 2.07k forks source link

[bug]: Interop: INVALID_ONION_HMAC #8785

Closed JssDWt closed 3 months ago

JssDWt commented 4 months ago

Background

There is an interop issue with Core lightning that causes INVALID_ONION_HMAC errors. It consistently reproduces with the following setup:

LND sender -> LND routing node -> CLN routing node -> CLN recipient

It does NOT reproduce with

Your environment

Steps to reproduce

Create 4 lightning nodes with channels between them:

Send a payment from the sender to the recipient over the two routing nodes.

Expected behaviour

The payment should succeed

Actual behaviour

The payment fails with INVALID_ONION_HMAC from index 1.

testing setup

logs

lnd-lnd-sender.log lnd-lnd-router.log lightningd-cln-router.log lightningd-cln-recipient.log

CLN issue

https://github.com/ElementsProject/lightning/issues/7347

carlaKC commented 4 months ago

Are you able to hit this outside of your testing infrastructure? Ran a quick test this morning with LND -- LND -- CLN -(private)- CLN on the specified versions and I'm unable to reproduce this failure.

kingonly commented 4 months ago

@carlaKC this is an issue we've seen in production inconsistently with a few customers and we were to reproduce consistently in a testing environment.

darioAnongba commented 4 months ago

Hi all, we're the ones getting the bug in production in case I can be of any help. This issue is preventing affected nodes running on Greenlight infrastructure to receive payments from Alby or Blink (only ones tested so far).

carlaKC commented 4 months ago

@carlaKC this is an issue we've seen in production inconsistently with a few customers and we were to reproduce consistently in a testing environment.

Very helpful to be able to consistently reproduce it! I'm just wondering what the difference is between the test setup and the quick test I ran manually - that'll be very helpful to try pin-point where the issue is.

JssDWt commented 4 months ago

I'm just wondering what the difference is between the test setup and the quick test I ran manually

I agree! The logs from a test are supplied at the bottom of the issue description with trace level from lnd. Maybe differences can be seen in there. If there's any other information that can be helpful, or running tests with specific settings, let me know.

ellemouton commented 4 months ago

have reproduced 👍

rustyrussell commented 4 months ago

@ellemouton Need a build with more data? I can have it spit out all the gory details to stderr....

We will give this error if they try to send a legacy onion, but nobody does that any more. Surely?

rustyrussell commented 4 months ago

Indeed, LND log 2915:

   Payload: ([]uint8) (len=33 cap=64) {
    00000000  00 26 28 0f 74 eb c5 87  c8 00 00 00 00 00 00 27  |.&(.t..........'|
    00000010  10 00 00 00 88 00 00 00  00 00 00 00 00 00 00 00  |................|
    00000020  00                                                |.|

See that 00 at the front of the payload? That's legacy, which we removed support for in 23.05 :grimacing: (Wait, no, 22.11!)

Roasbeef commented 4 months ago

Weird, we'll only try to send a legacy onion if either the hop hints sent over don't have the proper feature bits set, or we never got a node announcement from the destination node (so we don't know their feature bit set).

Roasbeef commented 4 months ago

Ah, I think what's happening is that if CLN isn't setting the bit at all (we still set it, but have it marked as required), then we'd assume that the peer is super old and doesn't understand the new onion format.

JssDWt commented 4 months ago

@Roasbeef Can you elaborate which feature bits specifically?

rustyrussell commented 4 months ago

Apparently (according to @ellemouton) the node hasn't seen the CLN node announcement, and is assuming the worst. FWIW we stopped generating them in our 2022-04 release, six months before we stopped accepting them.

You should probably just rip out the code which generates the old-style legacy onions altogether. For us it was a nice simplification...

Roasbeef commented 4 months ago

@JssDWt the TLV onion feature bit.

What I think is happening here is CLN not setting the bit at all (or we don't have a node announcement, so we assume they don't support it). Newer lnd clients (0.18 and forward) will start to set the bit to required. Eventually we'll also just stop setting it all together, but we can't do that yet, as older clients (pre 0.18) will use the feature bit to decide if they can do things like MPP or not.

We can also just assume it's active (for 0.18), but as mentioned that won't help older clients (who still look at the bit).

ellemouton commented 4 months ago

CLN are setting the bit (it is set in the invoice) but the main issue is that the LND nodes in the setup are not seeing the CLN node announcement. They do get the channel announcement though & so know of the node but then assume a zero length feature bit vector. So that is the main issue.

The second issue (which I realised will only actually matter in the case where an invoice has chained hop hints) is that we (LND) assume that hop hints have zero length feature bit vectors (ie, we dont see them as having the TLV bit set and so we would use legacy encoding for these hops).