lightningdevkit / rust-lightning

A highly modular Bitcoin Lightning library written in Rust. It's rust-lightning, not Rusty's Lightning!
Other
1.16k stars 366 forks source link

`Scorer` Next Steps #1170

Closed TheBlueMatt closed 2 years ago

TheBlueMatt commented 2 years ago

In 0.0.103 we landed a pretty naive scorer to avoid re-routing over the same channel over and over again on retries. #1166 adds a bit better scoring, but we've still got a ways to go. We should probably, at least

renepickhardt commented 2 years ago

While you address some questions that are not related to liquidity (e.g. temporary channel failure can mean peer offline) I think it is reasonable to look at the liquidity case first. The updates of the probabilities in the liquidity case are so stark that they should cover the peer offline case in most cases reasonably well (unless we really run into sending all the sats that the network might support which is a very hard edge case anyway in which we probably just open a new channel)

Thus I think most of the questions and wishes that you are addressing here (but in particular your second bullet point) are explained in the general case for an arbitrary probability function in the probabilistic payment delivery paper and spilled out explicitly for the uniform case that we observed on the network in the optimally reliable and cheap payment flows paper . As papers like those are always written a bit special and technical I will take the time to walk you through the main results that address your concerns

Let's start with a bit of Notation for the sake of shorter text I will do it not as formal as in those papers and refer for the details to the sections. also since this is not a paper and just a quick summeray I don't guarantee that all I will write here is free of typos so if in doubt double check with the papers that have been proof read by more than just my eyes.

Modelling the Liquidity (channel Balance) as a uniform distribution

Chapter 3 of the first paper states that we can model the uncertainty of the liquidity in a channel (aka channel balance) as a probability function with a random variable P(X). We define the Failure probability of a payment of amount a of a channel as P(X<a) (which means the random variable takes any value smaller than a meaning the likelihood that the local balance is smaller than a in which case the payment can't be forwarded). We discuss in future work that of course other features like connectivity, uptime etc could be modeled probabilisticly and one would go to the join distribution.

As mentioned before all mainnet experiments hint that we can assume a uniform distribution for the uncertainty of the liquidity / channel balance in which case the failure probability for a channel of capacity c can be written as P(X<a)=a/(c+1)

Since a success exists if there is no failure we can write the success probability as P(X>=a) = 1 - P(X<a) = 1-a/(c+1) = (c+1-a)/(c+1) Again this is the uniform case which is reasonable to choose unless we have other data, indication or knowledge.

In Section 2.2. of the Second Paper we Introduce a term that we should already have used in the first Paper which is the Uncertainty Network. While Gossip gives us the Channel Graph (which we could and should already understand as a directed network) we can use the Channel Graph to create the Uncertainty network which is certainly a directed graph. For this we Assign the Probability Distribution and Random Variable to each edge of the Channel Graph.

Learning from failed and success full attempts (uniform case) on the Uncertainty Network.

Let's ignore for a moment the fact that we should forget learned information over time and start to ask the question what can we learn from a failed or successful payment attempt.

This is discussed in very general terms with entropy considerations and information gain (Kulback Leibler Divergance) in Section 4.3 of the first paper and very specifically with respect on how to update the Uncertainty Netork at the end of chapter 3 in the Second Paper.

Let's say a payment delivery algorithm selects a route and wants to add an HTLC with amount h to a channel of capacity c there are several cases that we can observe to update our uncertainty about the network

  1. Payment failed at an upstream channel: This means the payment didn't reach our channel. We did not learn anything and the Uncertainty stays the same meaning the success probability for making a payment of size a stays at P(X>=a) = (c+1-a)/(c+1)
  2. Payment failed at the channel: This means the channel of capacity c has at most h-1 satoshi in it. In Terms of Probability theory we can ask ourselves what is the success rate for a subsequent payment of size a given the event X<h (the htlc h has failed) which can be expressed as P(X>=a | X<h). Here we see because of the condition that for a>=h the proabbility has to be 0. Computint the conditional probability for the remainder in the uniform case we can se P(X>=a | X < h) = (h+1 -a) / (h+1). Note that this is the same as the success probability for a channel of capacity h.
  3. Payment failed at a downstream channel: This means that the channel of caapcity c was able to route h satoshi and has at least h satoshis as a local balance. This means that P(X>=a | X>=h) = 1 for all amounts a <=h and in the unform case for all larger values of a the conditional probability materializes to: P(X>=a)/P(X>=h) = ((c+1-a)/(c+1))/((c+1-h)/(c+1)) = (c+1-a)/(c+1-h) Note that this fraction is always smaller than 1 as a >=h
  4. The case that the payment was successfull or did not return an error. Following the same thoughts and the first bullet point at the end of section 3 in the second paper we know that we have to look at P(X>=a + h | X >=h) which in the uniform case materializes to ( (c-h) + 1 - a)/( (c - h) + 1)

Of course all those thoughts can generalize more or less easily (at least in principle it is easy but just technical) to cases where more than 1 htlc is in flight and/or has been returned or not and what so ever. @stefanwouldgo and I have created a more or less stable implementation for this in our mainnet tests of optimally reliable payment flows so I can tell you it is certainly possible.

Provencance scores

I think measuring a history of successful payments similar to what mission control does is reasonable in general. (I think that is what you addressed in your first bullet point) I have a half ready paper on how to count this properly lying in my drawer. The gist is to do the following two things.

  1. Learn a distribtion of provencance scores from the few samples that we get during running the node. This is actually very different to LND as they just learn a simple provenance score which they call probability that measures a statistic of past success and failures.
  2. create the joint distribtion between the provanance score and the uncertainty distribution as described above.

My feeling (!) is that the provenenace scores might be interesting for LSPs but not really for end users wallets. Also I would assume the effect is neglelctable with comparision to the uncertainty rising from the liquidity.

Forgetting the learned information

This is something that we mention as future work in our papers. Basically we need a way to transform the updated probability distribution back to our fully uncertain distribution. This can be down by decying the probed / tested / learned amounts exponentially over time towards zero (the speed of the decay would have to be measured emperically) And obviously one could also do this linearly instead of exponentially. I would have loved to do research about this as I think this is a necessary and an easy apple to pick. In General I suggest to look at Homotopy to learn how to transform the updated probabilities back to the original probabilities.

Conclusion and general thoughts:

Of course it is needless to say that instead of maximizing probablities which are multiplicative along all channels of a (multi)path. we can switch to negative logs of the probabilities and minimize them which are additive across all involved channels.

The main strength of this framework is that it avoids making arbitrarty choices like "introduce a bias if a payment is smaller or larger than a privious attempt" Here we just update the uncertainty in the information theoretic correct sense and the formulars are actually wicket easy for the uniform case.

Also all experiments done indicate that uncertainty of liquidity is by far the most dominant feature. So I strongly suggest to use this as a baseline before going into others. While fees are also an obvious feature that have been discussed in the c-lighting pr at https://github.com/ElementsProject/lightning/pull/4771 I think the CLTV (or riskfactor) is actually not helpful (though c-lightning kept it (I guess for nostalgic reasons)). I am currently investigating how to add latency (however this has similar problems like the base fee in the sense that it has a jump from not using a channel to using the channel and thus makes computations of min cost flows expensive) Yet and as you saw on twitter there is indication that bee line distance might be a good proxy for latency especially the current beeline distance on the network is very poor from a latency perspective with many channels crossing oceans

geodistance beelineDistance

Let me know if you should have more questions.

jkczyz commented 2 years ago

Say we successfully route h_1 msats through a channel with capacity c, corresponding to case (4). My understanding is that the channel's capacity has been effectively reduced by h_1 when computing the success probability, representing less liquidity.

Say we fail to route h_2 msats through the same channel, corresponding to case (2). My understanding is that the channel's capacity has been effectively reduced to h_2 when computing success probability, representing a maximum possible liquidity.

Now after either of those scenarios, say routing h_3 msats fails at a downstream channel, corresponding to case (3). Does the success probability for future payments use h_1 (or h_2, respectively) for c in the formula (c+1-a)/(c+1-h)?

More generally, as more payments are successfully routed (or fail to route) through a channel, how does this affect the success probability calculation for the channel? IIUC, an implementation should keep track of the most liquidity c that it thinks a channel has (corresponding to adjustments of the capacity from cases (2) and (4)) and the largest amount h that it should be able to route through the channel (based on downstream failures from case (3)). And the latter must be reset to 0 whenever it exceeds the former.

Thus, ignoring the 0 and 1 success probability cases, the calculation would be (c+1-a)/(c+1-h), where again c is the adjusted capacity (i.e., the estimated liquidity). Does this sound right?

stefanwouldgo commented 2 years ago

In our code, we have modelled these issues using two extra variables per edge: minBalance and inFlight: https://github.com/renepickhardt/mpp-splitter/blob/a90e134e4bd39bd8496d047851a993b1393a1a9d/code/scala/mincostflow/heuristic/MinCostFlow.scala#L159

minBalance is the minimum balance that we know can be routed through this channel, and inFlight is what we know is already used of the balance (typically by an already locked-in htlc).

From these we derive effective amount and effective capacity values like this:

effAmt = max(0,inFlight+a-minBalance)
effCap = c+1-minBalance

and then calculate the probability as (effCap-effAmt)/effCap and finally take -log of this as cost function.

jkczyz commented 2 years ago

In our code, we have modelled these issues using two extra variables per edge: minBalance and inFlight: https://github.com/renepickhardt/mpp-splitter/blob/a90e134e4bd39bd8496d047851a993b1393a1a9d/code/scala/mincostflow/heuristic/MinCostFlow.scala#L159

minBalance is the minimum balance that we know can be routed through this channel, and inFlight is what we know is already used of the balance (typically by an already locked-in htlc).

Does inFlight include resolved HTLCs, too (i.e., case 4)? If not, are you accounting for them in calculating the effective capacity? By "in-flight", I typically think of an outstanding HTLC that we are waiting on for the preimage from the recipient.

From these we derive effective amount and effective capacity values like this:

effAmt = max(0,inFlight+a-minBalance)
effCap = c+1-minBalance

and then calculate the probability as (effCap-effAmt)/effCap and finally take -log of this as cost function.

What exactly does effAmt represent? And won't subtracting it from effCap cancel out minBalance?

effCap-effAmt
 = c+1-minBalance-(inFlight+a-minBalance)
 = c+1-minBalance-inFlight-a+minBalance
 = c+1-inFlight-a

FWIW, this is the formulation that I came up with:

if a > max_liquidity_msat {
    0.0
} else if a < min_liquidity_msat {
    1.0
} else {
    (max_liquidity_msat + 1 - a) / (max_liquidity_msat + 1 - min_liquidity_msat)
}

Where max_liquidity_msat is initialized to the channel capacity (i.e., case (1)), set to the last failed HTLC (i.e., case (2)), and reduced by resolved HTLCs (i.e., case (4)). And where min_liquidity_msat is set to the largest HTLC that failed downstream (i.e., case (3)) and reset to 0 whenever max_liquidity_msat falls below it. For simplicity, I'm not accounting for in-flight HTLCs as defined earlier.

This formula is the same as the one in case (3) where c is max_liquidity_msat and h is min_liquidity_msat. Comparing this to your code, you don't seem to be accounting for the knowledge learned from cases (2) and (4). Or maybe minBalance is conflating those cases with case (3)? Hard to tell. Let me know where I may be misunderstanding.

renepickhardt commented 2 years ago

Does inFlight include resolved HTLCs, too (i.e., case 4)?

In our code base in does not include resolved HTLCs as we just demonstrated the effectiveness for a single payment attampt using min cost flows. If I where to use such techniques as an LSP I would not forget the knowledge after one payment session and would also include resolved HTLCs to some degree for a certain amount of time.

What exactly does effAmt represent? And won't subtracting it from effCap cancel out minBalance?

look at point 3 below where I tried to address this.

FWIW, this is the formulation that I came up with:

if a > max_liquidity_msat {
    0.0
} else if a < min_liquidity_msat {
    1.0
} else {
    (max_liquidity_msat + 1 - a) / (max_liquidity_msat + 1 - min_liquidity_msat)
}

I think this is the essence of what I propose formulated in a different way. So the entire question boils down to one of accounting. Let me go one step back. In general we have a channel that looks like this:

0---------------------------x---------------------------c

this corresponds to the uncertainty that the actual liquidity is 0 <= x <= c and results in the success probability for an amount a of (c+1 - a)/(c+1)

If we now know something about the channel which can happen due to several circumstances (failed, successful HTLCs or unresolved HTLCs in flight) our uncertainty gets reduces and we can adopt the probabilities. which is exactly what you do by setting the probability to 0 if the payment is above the assumed liquidity and to 1 if it is blow the assumed existing liquidity.

so given some prior knowledge of min and max liquidity our channel looks like this:

0----------min-----------------x------------max---------------c

which corresponds to 0 <= min <= x <= max <= c

Like in your formula we have three cases for a payment of amount a we have three cases (the first two as already discussed)

  1. amount is blow min existing liquidity --> probability = 1
  2. amount is above max existing liquidity --> probability = 0
  3. the amount is somewhere between min and max which means that we have uncertainty again and we have to compute a uniform distribution in the reduced interval where the effectiveCapacity = max - min and the effective amount is only everything that is larger than min. With that setting we can use the typical formular that you have in your thrid case, that @stefanwouldgo provided and that is the title of the issue.

Where max_liquidity_msat is initialized to the channel capacity (i.e., case (1)), set to the last failed HTLC (i.e., case (2)), and reduced by resolved HTLCs (i.e., case (4)). And where min_liquidity_msat is set to the largest HTLC that failed downstream (i.e., case (3)) and reset to 0 whenever max_liquidity_msat falls below it. For simplicity, I'm not accounting for in-flight HTLCs as defined earlier.

As long as you do a single path payment there is no strong necessity to account for in-flight HTLCs. of course if you are an active LSP you might have several concurrent payments that you are handling and you use knowledge from various payment sessions. But if you are in a setting as @stefanwouldgo and I were where we compute a flow it happens quite often that we allocate several HTLCs to the same channel and we have to account for in-flight HTLCs.

From the perspective of an LSP it also makes sense to maintain the uncertainty network for more than one payment session and maintain those upper and lower bounds of where you are certain with respect to the existing liquidity.

This formula is the same as the one in case (3) where c is max_liquidity_msat and h is min_liquidity_msat.

This is also my understanding as I tried to lay out in my above comments

Comparing this to your code, you don't seem to be accounting for the knowledge learned from cases (2) and (4). Or maybe minBalance is conflating those cases with case (3)? Hard to tell. Let me know where I may be misunderstanding.

I think our way of tracking minBalance and inFlight conflates those cases accurately but of course accounting is always tricky and we might have screwed something up.

jkczyz commented 2 years ago

In our code base in does not include resolved HTLCs as we just demonstrated the effectiveness for a single payment attampt using min cost flows. If I where to use such techniques as an LSP I would not forget the knowledge after one payment session and would also include resolved HTLCs to some degree for a certain amount of time.

IIUC, your code base doesn't account for case (2) in addition to case (4). Ignoring in-flight HTLCs, the formula becomes:

(c+1-a) / (c+1-minBalance)

vs

(max_liquidity_msat + 1 - a) / (max_liquidity_msat + 1 - min_liquidity_msat)

In other words, it is using a capacity c that is not adjusted for HTLCs failed at the channel or for resolved HTLCs, cases (2) and (4) respectively, which is what I was asking about modeling.

renepickhardt commented 2 years ago

I discussed this with @stefanwouldgo and yes the linked code base does not take into consideration all of the mentioned cases. The code that we shared was mainly to demonstrate the feasibility of the min cost flow solver and probabilistic approach. We did not share all of our code from our integration of the uncertainty network in our mainnet test as we feared it was very hacky and ignoring many roadblocks that can happen in reality (e.g. other errors, htlc limits, channel reserves,...) and we did not want people to copy poorly written proof of concept code.

So I would suggest to follow the formulas here and the theory to derive the full result / integration for the general case.

Note two things:

  1. The two formulas above are the same if c is the effective capacity and minBalance is the effective min balance.
  2. The four cases that I initially described were just for the scenario where we have seen one event with 1 htlc in the past. In reality - even within a single payment session and especially when we have MPP and flows - we might have several (contradictonary) observations. In our implementation of the uncertainty network we have written code to account for this using the accounting tricks described above.

Let me know if you need more information / help and I am really looking forward to see / test your implementation in work!

jkczyz commented 2 years ago

@renepickhardt @stefanwouldgo I have this implemented in a draft PR (#1227) with an open question on the cost function: https://github.com/lightningdevkit/rust-lightning/pull/1227#discussion_r779693401.