zkvm: constant-size timeouts for multi-hop payment channels

Credits

This is a ZkVM-specific implementation of the idea originally described in Sprites paper by Andrew Miller, Iddo Bentov, Ranjit Kumaresan, Christopher Cordi and Patrick McCorry (2017): https://arxiv.org/pdf/1702.05812.pdf

Intro

Lightning Network uses HTLCs ("hash+timelock contracts") to ensure atomic update of balances across multiple ledgers, where ledgers are private to each peer. The purpose of HTLC is to guarantee that "I send 1 coin only if I receive 1 coin". First, all nodes along a multi-hop route enter those HTLC contracts one-by-one starting with a sender. Then, starting with the recipient, they share the preimage. Once a node knows a preimage, it has assurance that both incoming and outgoing HTLCs can be resolved, so it can independently upgrade both contracts to unconditional state (with HTLC condition effectively stripped off).

Problem

For each node along a multi-hope route, there must be a safe difference between timeouts for outgoing HTLC and incoming HTLC: if the outgoing one is resolved with a preimage before some time T, there must be T+delta timeout to resolve the incoming HTLC. Delta is typically many hours (e.g. 12 hours) to make sure closing transaction can be published before the money is reverted. Likewise, for failure case: incoming HTLC must not fail until outgoing HTLC fails.

        A  ---------> B ---------> C ---------> D

HTLC:        36 hr        24 hr        12 hr

This means, that N-hop route has N*delta maximum timeout on the side of the sender. Meaning, that the sender faces multi-day funds lock up in case the payment failed. This makes large distances too risky in terms of time value of capital wasted, which in turn increases amount of capital required to be locked up by each node to ensure good connectivity with short routes.

For context, a realistic network with minimal capital overhead requires log(N) hops on average, so ≈20 hops for 1 million nodes. It's a lose-lose scenario.

Abstract definition of O(1) HTLC

How do we replace O(n) timeout for n hops with O(1)? We need to guarantee that the use of the unlock beacon ("preimage is revealed" in classic HTLCs) is split in two parts instead of being packaged as one:

reveal of the beacon before T1
use of the beacon before T2

This ensures that if the beacon is not revealed by T1, it cannot possibly be used by anyone at any point in the route. But if it is revealed by anyone, there's T2-T1 extra time for everyone to use it for resolution of their own contracts, simultaneously.

There are two more requirements:

Anyone should be able to create a beacon once the preimage is known.
Anyone should be able to use anyone's beacon once it's revealed.

The first requirement provides security after the preimage is cooperatively disclosed by the recipient until nodes re-sign their payment channels with HTLC condition removed. If you know you can create the beacon anytime, you are safe to sign-off an unconditional outgoing payment before getting the incoming one signed-off.

The second requirement is key to O(1) timeout: if anyone succeeded at revealing the beacon B' right before T1, it may be too late for anyone else to create their own B'', so they should be able to use B' as-is and complete the incoming HTLC before T2 (when the rollback would be allowed).

ZkVM implementation

To implement such HTLC, we need to wrap it into an issuable asset (insert NFT joke here). Flavor ID is defined by the issuance program, which check the tx.maxtime against the global timeout (T1) and a hash preimage. The payment channels replace HTLC condition with a "proof of utxo existence" for the asset with such ID.

If nodes propagate preimages normally, then all nodes know that they can issue such token in the same transaction used to close the channel (merkle path consists of simply the contract ID at that point). No one needs to actually issue a token if everyone cooperates to update balances with HTLC conditions removed.
If some node successfully issues such token right before T1 expires, other nodes can observe that and use Utreexo merkle path to it as a way to resolve their HTLCs before timeout T2.
If by time T1 no one issued such token, then it's not issuable anymore and everyone safely cancels HTLCs after T2.

The solution yields constant 2-interval timeout that scales to any number of hops (e.g. 24 hours, if we assume 12-hour interval necessary for reacting to on-chain events). This is equivalent overhead to traditional 2-hop routes and strictly better for more hops.

Constant-sized timeout enables network to safely use minimal capital lockup with a binary tree topology, where each node has at most 3 channels: 2 "down" and 1 "up", with log2(n) hops required to reach any node.

stellar / slingshot