lightningdevkit / rust-lightning

A highly modular Bitcoin Lightning library written in Rust. It's rust-lightning, not Rusty's Lightning!
Other
1.15k stars 363 forks source link

Don't panic when fallen behind #2325

Open TonyGiorgio opened 1 year ago

TonyGiorgio commented 1 year ago

This panic has some suggestions in it but makes it impossible to do anything about it by panicing. It should probably either handle what it is recommending in the logs or it logs and disconnects peer at the very least so that a user can actually force close like it is suggesting.

https://github.com/lightningdevkit/rust-lightning/blob/4dce209e972a7aeb18d99122ca293e6a272844b1/lightning/src/ln/channel.rs#L4112-L4120

TonyGiorgio commented 1 year ago

I had the other peer close the channel and it's been over 100 blocks and the funds are on chain. However, I'm still getting this issue. It's quite clear that panicing is a terrible way to handle this as it makes the entire node inoperable.

wpaulino commented 1 year ago

You're still hitting the panic because peer connections are allowed. As the log suggests, you should restart without establishing connections and force close, but of course this may not be practical for certain application deployments.

@TheBlueMatt any reason we don't just force close (without broadcasting) for the user here? If it's due to a bug, the node operator should realize their channel is gone and hopefully check their logs for an explanation.

TonyGiorgio commented 1 year ago

this may not be practical for certain application deployments

I'm not sure if it would be practical for any application deployments. When an LDK node launches, it should be expected that it should reconnect to the peers it has a channel with, by default. A user manually going into settings or making an API call for each channel partner is not something I think anyone would suggest.

So with auto connection behavior being the norm, and prioritizing quick auto connections in order to not slow down the user (especially mobile app users), I don't really see an opportunity to take any action. I have even tried to be as quick as possible to go into settings and disconnect peer before it attempted to connect. Sadly, we all created software that is too quick.

I cant think of any way to provide a good user experience around quick reconnections and working around this bug. Unfortunately, without async traits on state changes, we're a little more proned to data inconsistencies than we would like to be. So being able to work around force closures like this is a necessity. Hard to do when it crashes.

wpaulino commented 1 year ago

I agree but I think, given the flexibility of LDK, the idea here would be for a custom recovery tool to be written that you can run separately from the usual node application.

TonyGiorgio commented 1 year ago

Needless to say that's pretty extreme to expect users to both understand what happened and work around it.

moneyball commented 1 year ago

Can there be a default implementation that handles this for many users and for users who want to override it will be flexible enough they can substitute their own?

On Wed, May 31, 2023 at 12:08 PM Tony Giorgio @.***> wrote:

Needless to say that's pretty extreme to expect users to both understand what happened and work around it.

— Reply to this email directly, view it on GitHub https://github.com/lightningdevkit/rust-lightning/issues/2325#issuecomment-1570778768, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPUAZ65MXJ6IVZY27UQSLXI6JJHANCNFSM6AAAAAAYS7M6ZI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

TheBlueMatt commented 1 year ago

Sorry for the delay responding here,

@TheBlueMatt any reason we don't just force close (without broadcasting) for the user here? If it's due to a bug, the node operator should realize their channel is gone and hopefully check their logs for an explanation.

Maybe? I think the theory with a panic here is that it gives developers a chance to fix they go to production, as its really something we shouldn't ever hit, ever.

Sadly this isn't a really great solution - the application cannot automatically take any broadcasting action to get their funds back, they can only ask their peer, which may ignore the request.

At this point we're really doing "the SCB thing", which in practice results in lots of users only getting some of their money back. I'd like to better understand the options we have to address it in other ways before falling back to the SCB approach.

Unfortunately, without async traits on state changes, we're a little more proned to data inconsistencies than we would like to be.

Not sure I understand this one - lack of async persistence shouldn't result in inconsistency, only more difficult application programming - can you describe in a bit more detail how this corruption happened and why it can't be fixed?

TonyGiorgio commented 1 year ago

Maybe? I think the theory with a panic here is that it gives developers a chance to fix they go to production, as its really something we shouldn't ever hit, ever.

I'm sure even in rare disk failure events, LND has probably even had it's fair share of out of sync state updates, esp on raspberry pi's.

lack of async persistence shouldn't result in inconsistency, only more difficult application programming - can you describe in a bit more detail how this corruption happened and why it can't be fixed?

We can't call any blocking code, so it makes ensuring persistence through sync calls difficult when indexeddb is async. We essentially have to spawn_local to send it to the background to write while we immediately return. So state persisting issues can occur.

I might be able to use crossbeam channels inside the sync function so I'll try that. Otherwise we will have to write to local storage since that works with sync and then check that instead.

TonyGiorgio commented 1 year ago

I think we've been able to work around some of this by saving to localstorage with a sync call and always reading from that first, though localstorage is not the recommended way to save data in browsers but we need async to always save to indexeddb instead.

I still don't think panicing here is the best approach. It's cripling the ability for the wallet dev to provide any support at all to users this happens to. What is going to happen when state is saved remotely and the remote server missing an update but it pulls it down one state behind and panics? I assume VSS wants to be capable of saving remote state and there's no 100% guarentees around that always working.

Users are expected to somehow take the data, go into some external application to "do something" and recover their funds? All because a panic happened instead of a disconnection? It doesn't make any sense to me why that's the recommended option. By panicing, it's impossible to even open the app to be able to tell what is going on.

TheBlueMatt commented 1 year ago

Lets revisit this post-async/117. While I agree panicing is likely not the right answer, I'm not sure what it is, and until then its the conservative answer (and forces devs to fix the bug if there is one).

TheBlueMatt commented 1 year ago

Dont think we need to close this, just tag it 0.1 :)