Don't shutdown lit on accounts service critical error, and register to status server

ViktorTigerstrom commented 9 months ago

This PR is based on #541

In this PR we ensure that if the accounts service critically errors, we won't shutdown LiT, but instead log the error and reject any new requests to the accounts service.

We also add the accounts service to the status manager, and use the status from the status manager to determine if we should allow gRPC requests for the accounts service through or not.

Finally, we also add a config option to allow users to disable the accounts service, which ensures that the accounts service isn't started if the config option is set to disable.

TODO:

[x] add tests

ViktorTigerstrom commented 9 months ago

Thanks for the review @ellemouton! Posting a general comment here to respond what much of the feedback is related to :).

The main issues I see with just having one outer check that checks if a requests should be allowed through or not based on if the service is running or not, and then not checking in any of the internal functions in the server itself, is that:

The internal functions in the accounts system is not only called through gRPC requests, but we are also subscribed to the lndclient.LightningClient which will execute invoiceUpdate when an accounts invoice is added or paid, and lndclient.RouterClient which will execute paymentUpdate when a tracked payment is updated. It's important that these functions won't get executed if the service has already been disabled, especially for invoiceUpdate as we'd then risk that the currentAddIndex & currentSettleIndex goes out of sync leading to missed added or settled invoices.
The service can shutdown during the time a request that has been allowed through is actually executed. I.e. if an account makes a request to send a payment, the service can be Running when the actual request is done, but when the payment has been sent and we handle the SendPaymentV2 response in TrackPayment, the service might have been shutdown through for example an invoice update that came in while the users request was processed.

This is also why it's very important that we often disable the service under the same lock in the function execution that actually caused the error. Imagine of the following scenario: We get 2 invoice settle updates rapidly that settles two different invoices in a very short timespan. Now if the first invoice update acquires the the lock, and the second update attempts to grab the lock in another thread. It's in that scenario it's super important that the we disable the service under the same lock as the update is attempted for the first invoice, if the update fails. If we didn't and dropped the lock, and reacquired it to disable the service, we'd risk that the second invoice update grabs the lock before we've managed to disable the service. If the second invoice update then succeeds, we'd end-up with the currentSettleIndex out of sync, which would mean that we'd never credit the account for the first invoice update. So if we remove the requireRunning function, we need to add a IsRunning function version that doesn't also acquire the lock, i.e. a UnsafeIsRunning is running function.

Now it's most clear why we'd need to check if the service is already running before we execute the invoiceUpdate function, and we may not need to do that for any other function. But I didn't really know where to draw the line of when we'd require that the service is running if we executed the function, so I settled for functions that update the balance of an account. This is because it feels strange that in a scenario that the service was disabled while send payment request by an account was executed, but before the SendPaymentV2 response from LND was received, that we'd still successfully the handle the payment update. Meaning we'd successfully execute TrackPayment which adds the subscription to the lndclient.RouterClient and which in turn would then execute paymentUpdate once the payment has been updated. It just feels strange that all of those call would successfully execute while the service was disabled, which is why I drew the line of requiring that the service is running for any function that updates the user balance.

ellemouton commented 9 months ago

ok cool! thanks for the explanation @ViktorTigerstrom - that makes sense. I didnt think about point 1.

ViktorTigerstrom commented 9 months ago

Thanks a lot for the reviews @ellemouton & @guggero! Working on addressing them.

Though I'd like some more feedback on how we should surface the errors to the status server, so would just like to get consensus on that. Thanks for the suggestion on using a callback instead @guggero, will definitely change to that! Though would like to know if we should introduce the sub system server like @ellemouton suggested, to make it more generalizable for future sub systems we might add, or if we should leave that for now and do that later if needed? I think we will still need the internal RequireRunning checks either way though.

ellemouton commented 9 months ago

would like to know if we should introduce the sub system server like @ellemouton suggested, to make it more generalizable for future sub systems we might add, or if we should leave that for now and do that later if needed?

I'd say that if we are on a bit of a time limit with this rn, then we can worry about making things more generalisable in a follow up 👍 Since it sounds like that would take some design work

guggero commented 9 months ago

would like to know if we should introduce the sub system server like @ellemouton suggested, to make it more generalizable for future sub systems we might add, or if we should leave that for now and do that later if needed?

I'd say that if we are on a bit of a time limit with this rn, then we can worry about making things more generalisable in a follow up 👍 Since it sounds like that would take some design work

Was about to suggest the same! Even though turning everything into subservers sounds great, might be overkill for this PR. So let's see how the diff looks like with just the callback for now.

ViktorTigerstrom commented 9 months ago

Thanks! Will leave it out for now then, and then we can address it in a follow-up later :)

ViktorTigerstrom commented 9 months ago

Addressed the latest feedback, rebased on main now that the status server PR has been merged, and added unit tests!

ViktorTigerstrom commented 9 months ago

Hmm sorry, looking into to addressing the unit-race errors.

In terms of itests, I'm looking into adding an itest that triggers the mainErrCallback to be called, to shutdown the accounts service while not disabled. Though unfortunately I think it's going to be a bit hard as the litd process is started from binary and not through the struct. So in case I can't find any, I'll just add a disabled through configuration itest.

check commits is expected to fail until the commits are squashed.

ViktorTigerstrom commented 9 months ago

Fixed CI errors! check commits is expected to fail until the PR has been squashed.

ViktorTigerstrom commented 9 months ago

Updated the itests to test that we can disable the accounts system through configuration. Also updated the commit that adds that config option a little, as I noticed some bugs in it when making the itest.

ViktorTigerstrom commented 9 months ago

Thanks for the review at @guggero! Addressed the feedback and squashed the commits :)!

Snap though, I notice that the check commits check still fails, so will address that now.

ViktorTigerstrom commented 9 months ago

Fixed check commits CI failure :)

ViktorTigerstrom commented 9 months ago

Thanks @guggero 🎉!! Addressed you're latest feedback with the last push :)

ViktorTigerstrom commented 9 months ago

Added a small fix that ensures that the node runner can't stop the accounts service while a request by an account user is being processed. This is especially important to ensure that we don't stop the service exactly after a user has made an rpc call to send a payment we can't know the payment hash for prior to the actual payment being sent (i.e. Keysend or SendToRoute). This is because if we stop the service after the send request has been sent to lnd, but before TrackPayment has been called, we won't be able to track the payment and debit the account.

ViktorTigerstrom commented 9 months ago

Thanks a lot for the the review @ellemouton 🎉 🚀!!

Addressed the latest feedback, and left some comments for some of the comments which weren't addressed :)

lightninglabs / lightning-terminal

Don't shutdown lit on accounts service critical error, and register to status server #642