lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.64k stars 2.07k forks source link

Potential deadlock between RequestRoute and LightningChannel commitment state machine #9133

Open bjohnson5 opened 6 days ago

bjohnson5 commented 6 days ago

Discussed in https://github.com/lightningnetwork/lnd/discussions/9060

Originally posted by **bjohnson5** September 3, 2024 I believe I have discovered a potential deadlock situation, but I am relatively new to LND and wanted to discuss it before opening an issue, to be sure that I am not missing something. This is when using the bbolt backend database. In `lnwallet/channel.go` the `LightningChannel` struct defines several methods that the comments explain as the "state machine which corresponds to the current commitment protocol wire spec". These methods are: `SignNextCommitment`, `ReceiveNewCommitment`, `RevokeCurrentCommitment`, and `ReceiveRevocation`. Each of these will first lock the `LightningChannel`: `lc.lock()` and then they will typically attempt to update the channel db. When updating the channel db, sometimes the database must be re-sized and re-mapped to memory using the `mmap` function in bbolt's `db.go` file. This function first attempts to lock the `mmaplock` mutex. This is all fine except that if one of the state machine functions is called while the node is trying to find a route a deadlock could occur. The `RequestRoute` function in `payment_session.go` will get a routing graph from the db and this will acquire the `mmaplock` on the db (for good reason, it needs to be sure the db is not re-mapped while it is using it to find a route). It will eventually call functions of the `LightningChannel` struct in order to find bandwidth, balances, etc... It is possible that these functions are locked by one of the state machine methods and that state machine method could be stuck waiting on the `mmaplock`. For example: **Thread1:** RequestRoute -> NewGraphSession() -> `mmaplock.lock()` ----------------------------------> p.pathFinder -> availableChanBandwidth -> *attempts to call LC functions, blocks waiting on `lc.lock()`* **Thread2:** ReceiveRevocation -> `lc.lock()` -----------------------------------------> AdvanceCommitChainTail -> *attempts to update db, blocks waiting on `mmaplock.lock()`* If anyone has experience in this area, please let me know if this is all correct and if I should open an issue. Thanks.
ziggie1984 commented 6 days ago

So analysed this issue, and I think there is absolutely no reason we keep an open transaction as part of the session implementation. Its only used there:

https://github.com/lightningnetwork/lnd/blob/84c91f701cab21c70c5f4292b82a8bcce9a79e63/channeldb/graphsession/graph_session.go#L86-L90

and I think there is no problem in reading the db once at this point fetching all the channels for the local node and closing the read transaction.

Should be fairly easy to fix.