Proposal: Reliable HTLC update forwarding

Problem

When an intermediary node receives an HTLC update from a peer, it must first negotiate with the peer to irrevocably commit the HTLC then forward the HTLC update to the next node in the circuit. This is currently implemented but may fail if lnd panics or spontaneously dies during the handoff from one link to the next (through the switch). In this case, it is possible for a packet that should be forwarded will get dropped even if the node comes back online in a timely manner.

Requirements

The primary goal is to ensure consistency meaning:

Avoid corruption and inconsistent states in on-disk data.
Do not drop messages that are to be forwarded even in the case that lnd panics or crashes during an arbitrary point during execution.
Ensure that the circuit map is persisted so that HTLC fail and settle updates are properly routed back through the circuit.
Maintain current logical code separation between channel, link, and switch.

We would also like to minimize disk IO, especially any reads/writes occurring in a main link or switch go routine that block message processing.

Possible solution

The idea is to store more of the update logs on disk and use coordinated access by different subsystems to hand off ownership of different portions of the log. Each update log is represented in BoltDB as a bucket where each entry is stored as a separate serialized value keyed by a 64-bit integer index.

Let's look first at the local update log. Each entry is an update message and when it enters the log it gets a sequentially ordered 64-bit index. There are 4 important checkpoints in the local log: the last index, the processed index, the committed index, and the ACKed index where last >= processed >= committed >= ACKed.

The entries between processed and last are called the "inbox". These are owned and stored in memory by the link and have not yet been handed off to the channel. Any HTLC adds in the inbox do not have an ID assigned yet. Once the link hands the update off to the channel the entry is removed from the link's in-memory inbox and the (conceptual) processed index is incremented. The update can be either accepted or rejected by the channel. If accepted, the entry is added to the channel's in-memory update log, otherwise it gets sent to a separate goroutine that deletes it from disk by ID concurrently (which we'll call the inbox garbage collector). Note that the inbox garbage collector can batch deletes and do BoltDB batch transactions. If an HTLC add is accepted, it gets an ID assigned in memory which is not written to disk until after it is ACKed. Unassigned IDs are represented on disk as -1.

All entries between the ACKed pointer and the processed pointer are owned by the channel. When we want to sign a new commitment, the channel needs to advance the on-disk committed log index. First, it waits for the inbox garbage collection to complete using a wait group (there must not be rejected entries on disk with indexes below the committed index), then writes any dirty log entries (eg. assigning IDs to added HTLCs) and actually updates the committed index.

When we receive a revocation, the on-disk ACKed index is moved forward, any forwarded HTLCs are added to the switch's in-memory circuit map, settled HTLCs are removed from the circuit map, and HTLC info is written to a persistent revocation log so that witness scripts of revoked transactions can be reconstructed. A compaction of the on-disk ACKed entries is performed by the garbage collector. It can remove HTLC adds and removes where the settle is ACKed and deletes any routing onions from the updates that have been ACKed. Since ACKed HTLC adds can be removed from the log, the compaction process also needs to write to the DB the index of the lowest HTLC in the log with an unassigned ID. Any later HTLCs can be assumed to have incrementing IDs if read off of disk on boot. This all happens concurrently in batchable transactions.

Now we'll look at the remote log. This just has a committed index and a processed index, neither of which are persisted. Only committed entries are stored on disk.

When we sent a revocation, we write all in-memory entries past the current committed index to disk. The major difference is the pruning. After sending the revocation, we signal to the compaction go routine for the log, which in the same BoltDB transaction compacts the log and writes any forwarding updates to the on-disk inboxes of the appropriate links. This gives us reliable forwarding. Only after this compaction & forward write is complete do we signal to the links to process the new updates in the inboxes. While this introduces write latency to the forwarding of packets, it happens without blocking the main loops of any links or the switch.

Restoring state on boot

On boot, we are able to reconstruct the following in-memory state just from the update logs:

Each channel's current in-flight HTLCs
The switch's circuit map (using the entries from the local log that have already been ACKed)
Any received HTLC updates that need to be forwarded by performing the atomic compaction handoff on every channel's remote log

Package isolation

This solution is able to minimize writes to disk by having multiple subsystems make use of the update logs and doing handoffs by just incrementing an on-disk pointer instead of doing a delete and write. Note that after a message enters the inbox of the local log, the only blocking writes to log entries are flagging of rejected updates. Any writes that can be performed concurrently are and we get idempotent writes by having a sequentially incrementing log index. Since write access is only allowed for one subsystem at a time partitioned by the various indexes, this should be safe. However, it probably makes sense to move the update log into a public struct or possibly its own package that can be called into by both the link and the channel.

lightningnetwork / lnd