Closed TheBlueMatt closed 2 years ago
While you address some questions that are not related to liquidity (e.g. temporary channel failure can mean peer offline) I think it is reasonable to look at the liquidity case first. The updates of the probabilities in the liquidity case are so stark that they should cover the peer offline case in most cases reasonably well (unless we really run into sending all the sats that the network might support which is a very hard edge case anyway in which we probably just open a new channel)
Thus I think most of the questions and wishes that you are addressing here (but in particular your second bullet point) are explained in the general case for an arbitrary probability function in the probabilistic payment delivery paper and spilled out explicitly for the uniform case that we observed on the network in the optimally reliable and cheap payment flows paper . As papers like those are always written a bit special and technical I will take the time to walk you through the main results that address your concerns
Let's start with a bit of Notation for the sake of shorter text I will do it not as formal as in those papers and refer for the details to the sections. also since this is not a paper and just a quick summeray I don't guarantee that all I will write here is free of typos so if in doubt double check with the papers that have been proof read by more than just my eyes.
Chapter 3 of the first paper states that we can model the uncertainty of the liquidity in a channel (aka channel balance) as a probability function with a random variable P(X)
. We define the Failure probability of a payment of amount a
of a channel as P(X<a)
(which means the random variable takes any value smaller than a
meaning the likelihood that the local balance is smaller than a
in which case the payment can't be forwarded). We discuss in future work that of course other features like connectivity, uptime etc could be modeled probabilisticly and one would go to the join distribution.
As mentioned before all mainnet experiments hint that we can assume a uniform distribution for the uncertainty of the liquidity / channel balance in which case the failure probability for a channel of capacity c
can be written as P(X<a)=a/(c+1)
Since a success exists if there is no failure we can write the success probability as P(X>=a) = 1 - P(X<a) = 1-a/(c+1) = (c+1-a)/(c+1)
Again this is the uniform case which is reasonable to choose unless we have other data, indication or knowledge.
In Section 2.2. of the Second Paper we Introduce a term that we should already have used in the first Paper which is the Uncertainty Network
. While Gossip gives us the Channel Graph (which we could and should already understand as a directed network) we can use the Channel Graph to create the Uncertainty network which is certainly a directed graph. For this we Assign the Probability Distribution and Random Variable to each edge of the Channel Graph.
Let's ignore for a moment the fact that we should forget learned information over time and start to ask the question what can we learn from a failed or successful payment attempt.
This is discussed in very general terms with entropy considerations and information gain (Kulback Leibler Divergance) in Section 4.3 of the first paper and very specifically with respect on how to update the Uncertainty Netork at the end of chapter 3 in the Second Paper.
Let's say a payment delivery algorithm selects a route and wants to add an HTLC with amount h
to a channel of capacity c
there are several cases that we can observe to update our uncertainty about the network
a
stays at P(X>=a) = (c+1-a)/(c+1)
c
has at most h-1
satoshi in it. In Terms of Probability theory we can ask ourselves what is the success rate for a subsequent payment of size a
given the event X<h
(the htlc h
has failed) which can be expressed as P(X>=a | X<h)
. Here we see because of the condition that for a>=h
the proabbility has to be 0. Computint the conditional probability for the remainder in the uniform case we can se P(X>=a | X < h) = (h+1 -a) / (h+1)
. Note that this is the same as the success probability for a channel of capacity h
. c
was able to route h
satoshi and has at least h
satoshis as a local balance. This means that P(X>=a | X>=h) = 1
for all amounts a <=h
and in the unform case for all larger values of a
the conditional probability materializes to: P(X>=a)/P(X>=h) = ((c+1-a)/(c+1))/((c+1-h)/(c+1)) = (c+1-a)/(c+1-h)
Note that this fraction is always smaller than 1
as a >=h
P(X>=a + h | X >=h)
which in the uniform case materializes to ( (c-h) + 1 - a)/( (c - h) + 1)
Of course all those thoughts can generalize more or less easily (at least in principle it is easy but just technical) to cases where more than 1 htlc is in flight and/or has been returned or not and what so ever. @stefanwouldgo and I have created a more or less stable implementation for this in our mainnet tests of optimally reliable payment flows so I can tell you it is certainly possible.
I think measuring a history of successful payments similar to what mission control does is reasonable in general. (I think that is what you addressed in your first bullet point) I have a half ready paper on how to count this properly lying in my drawer. The gist is to do the following two things.
My feeling (!) is that the provenenace scores might be interesting for LSPs but not really for end users wallets. Also I would assume the effect is neglelctable with comparision to the uncertainty rising from the liquidity.
This is something that we mention as future work in our papers. Basically we need a way to transform the updated probability distribution back to our fully uncertain distribution. This can be down by decying the probed / tested / learned amounts exponentially over time towards zero (the speed of the decay would have to be measured emperically) And obviously one could also do this linearly instead of exponentially. I would have loved to do research about this as I think this is a necessary and an easy apple to pick. In General I suggest to look at Homotopy to learn how to transform the updated probabilities back to the original probabilities.
Of course it is needless to say that instead of maximizing probablities which are multiplicative along all channels of a (multi)path. we can switch to negative logs of the probabilities and minimize them which are additive across all involved channels.
The main strength of this framework is that it avoids making arbitrarty choices like "introduce a bias if a payment is smaller or larger than a privious attempt" Here we just update the uncertainty in the information theoretic correct sense and the formulars are actually wicket easy for the uniform case.
Also all experiments done indicate that uncertainty of liquidity is by far the most dominant feature. So I strongly suggest to use this as a baseline before going into others. While fees are also an obvious feature that have been discussed in the c-lighting pr at https://github.com/ElementsProject/lightning/pull/4771 I think the CLTV (or riskfactor) is actually not helpful (though c-lightning kept it (I guess for nostalgic reasons)). I am currently investigating how to add latency (however this has similar problems like the base fee in the sense that it has a jump from not using a channel to using the channel and thus makes computations of min cost flows expensive) Yet and as you saw on twitter there is indication that bee line distance might be a good proxy for latency especially the current beeline distance on the network is very poor from a latency perspective with many channels crossing oceans
Let me know if you should have more questions.
Say we successfully route h_1
msats through a channel with capacity c
, corresponding to case (4). My understanding is that the channel's capacity has been effectively reduced by h_1
when computing the success probability, representing less liquidity.
Say we fail to route h_2
msats through the same channel, corresponding to case (2). My understanding is that the channel's capacity has been effectively reduced to h_2
when computing success probability, representing a maximum possible liquidity.
Now after either of those scenarios, say routing h_3
msats fails at a downstream channel, corresponding to case (3). Does the success probability for future payments use h_1
(or h_2
, respectively) for c
in the formula (c+1-a)/(c+1-h)
?
More generally, as more payments are successfully routed (or fail to route) through a channel, how does this affect the success probability calculation for the channel? IIUC, an implementation should keep track of the most liquidity c
that it thinks a channel has (corresponding to adjustments of the capacity from cases (2) and (4)) and the largest amount h
that it should be able to route through the channel (based on downstream failures from case (3)). And the latter must be reset to 0 whenever it exceeds the former.
Thus, ignoring the 0 and 1 success probability cases, the calculation would be (c+1-a)/(c+1-h)
, where again c
is the adjusted capacity (i.e., the estimated liquidity). Does this sound right?
In our code, we have modelled these issues using two extra variables per edge: minBalance and inFlight: https://github.com/renepickhardt/mpp-splitter/blob/a90e134e4bd39bd8496d047851a993b1393a1a9d/code/scala/mincostflow/heuristic/MinCostFlow.scala#L159
minBalance is the minimum balance that we know can be routed through this channel, and inFlight is what we know is already used of the balance (typically by an already locked-in htlc).
From these we derive effective amount and effective capacity values like this:
effAmt = max(0,inFlight+a-minBalance)
effCap = c+1-minBalance
and then calculate the probability as
(effCap-effAmt)/effCap
and finally take -log of this as cost function.
In our code, we have modelled these issues using two extra variables per edge: minBalance and inFlight: https://github.com/renepickhardt/mpp-splitter/blob/a90e134e4bd39bd8496d047851a993b1393a1a9d/code/scala/mincostflow/heuristic/MinCostFlow.scala#L159
minBalance is the minimum balance that we know can be routed through this channel, and inFlight is what we know is already used of the balance (typically by an already locked-in htlc).
Does inFlight
include resolved HTLCs, too (i.e., case 4)? If not, are you accounting for them in calculating the effective capacity? By "in-flight", I typically think of an outstanding HTLC that we are waiting on for the preimage from the recipient.
From these we derive effective amount and effective capacity values like this:
effAmt = max(0,inFlight+a-minBalance) effCap = c+1-minBalance
and then calculate the probability as
(effCap-effAmt)/effCap
and finally take -log of this as cost function.
What exactly does effAmt
represent? And won't subtracting it from effCap
cancel out minBalance
?
effCap-effAmt
= c+1-minBalance-(inFlight+a-minBalance)
= c+1-minBalance-inFlight-a+minBalance
= c+1-inFlight-a
FWIW, this is the formulation that I came up with:
if a > max_liquidity_msat {
0.0
} else if a < min_liquidity_msat {
1.0
} else {
(max_liquidity_msat + 1 - a) / (max_liquidity_msat + 1 - min_liquidity_msat)
}
Where max_liquidity_msat
is initialized to the channel capacity (i.e., case (1)), set to the last failed HTLC (i.e., case (2)), and reduced by resolved HTLCs (i.e., case (4)). And where min_liquidity_msat
is set to the largest HTLC that failed downstream (i.e., case (3)) and reset to 0
whenever max_liquidity_msat
falls below it. For simplicity, I'm not accounting for in-flight HTLCs as defined earlier.
This formula is the same as the one in case (3) where c
is max_liquidity_msat
and h
is min_liquidity_msat
. Comparing this to your code, you don't seem to be accounting for the knowledge learned from cases (2) and (4). Or maybe minBalance
is conflating those cases with case (3)? Hard to tell. Let me know where I may be misunderstanding.
Does inFlight include resolved HTLCs, too (i.e., case 4)?
In our code base in does not include resolved HTLCs as we just demonstrated the effectiveness for a single payment attampt using min cost flows. If I where to use such techniques as an LSP I would not forget the knowledge after one payment session and would also include resolved HTLCs to some degree for a certain amount of time.
What exactly does
effAmt
represent? And won't subtracting it fromeffCap
cancel outminBalance
?
look at point 3 below where I tried to address this.
FWIW, this is the formulation that I came up with:
if a > max_liquidity_msat { 0.0 } else if a < min_liquidity_msat { 1.0 } else { (max_liquidity_msat + 1 - a) / (max_liquidity_msat + 1 - min_liquidity_msat) }
I think this is the essence of what I propose formulated in a different way. So the entire question boils down to one of accounting. Let me go one step back. In general we have a channel that looks like this:
0---------------------------x---------------------------c
this corresponds to the uncertainty that the actual liquidity is 0 <= x <= c
and results in the success probability for an amount a of (c+1 - a)/(c+1)
If we now know something about the channel which can happen due to several circumstances (failed, successful HTLCs or unresolved HTLCs in flight) our uncertainty gets reduces and we can adopt the probabilities. which is exactly what you do by setting the probability to 0 if the payment is above the assumed liquidity and to 1 if it is blow the assumed existing liquidity.
so given some prior knowledge of min
and max
liquidity our channel looks like this:
0----------min-----------------x------------max---------------c
which corresponds to 0 <= min <= x <= max <= c
Like in your formula we have three cases for a payment of amount a
we have three cases (the first two as already discussed)
effectiveCapacity = max - min
and the effective amount is only everything that is larger than min
. With that setting we can use the typical formular that you have in your thrid case, that @stefanwouldgo provided and that is the title of the issue. Where
max_liquidity_msat
is initialized to the channel capacity (i.e., case (1)), set to the last failed HTLC (i.e., case (2)), and reduced by resolved HTLCs (i.e., case (4)). And wheremin_liquidity_msat
is set to the largest HTLC that failed downstream (i.e., case (3)) and reset to0
whenevermax_liquidity_msat
falls below it. For simplicity, I'm not accounting for in-flight HTLCs as defined earlier.
As long as you do a single path payment there is no strong necessity to account for in-flight HTLCs. of course if you are an active LSP you might have several concurrent payments that you are handling and you use knowledge from various payment sessions. But if you are in a setting as @stefanwouldgo and I were where we compute a flow it happens quite often that we allocate several HTLCs to the same channel and we have to account for in-flight HTLCs.
From the perspective of an LSP it also makes sense to maintain the uncertainty network for more than one payment session and maintain those upper and lower bounds of where you are certain with respect to the existing liquidity.
This formula is the same as the one in case (3) where
c
ismax_liquidity_msat
andh
ismin_liquidity_msat
.
This is also my understanding as I tried to lay out in my above comments
Comparing this to your code, you don't seem to be accounting for the knowledge learned from cases (2) and (4). Or maybe
minBalance
is conflating those cases with case (3)? Hard to tell. Let me know where I may be misunderstanding.
I think our way of tracking minBalance
and inFlight
conflates those cases accurately but of course accounting is always tricky and we might have screwed something up.
In our code base in does not include resolved HTLCs as we just demonstrated the effectiveness for a single payment attampt using min cost flows. If I where to use such techniques as an LSP I would not forget the knowledge after one payment session and would also include resolved HTLCs to some degree for a certain amount of time.
IIUC, your code base doesn't account for case (2) in addition to case (4). Ignoring in-flight HTLCs, the formula becomes:
(c+1-a) / (c+1-minBalance)
vs
(max_liquidity_msat + 1 - a) / (max_liquidity_msat + 1 - min_liquidity_msat)
In other words, it is using a capacity c
that is not adjusted for HTLCs failed at the channel or for resolved HTLCs, cases (2) and (4) respectively, which is what I was asking about modeling.
I discussed this with @stefanwouldgo and yes the linked code base does not take into consideration all of the mentioned cases. The code that we shared was mainly to demonstrate the feasibility of the min cost flow solver and probabilistic approach. We did not share all of our code from our integration of the uncertainty network in our mainnet test as we feared it was very hacky and ignoring many roadblocks that can happen in reality (e.g. other errors, htlc limits, channel reserves,...) and we did not want people to copy poorly written proof of concept code.
So I would suggest to follow the formulas here and the theory to derive the full result / integration for the general case.
Note two things:
c
is the effective capacity and minBalance
is the effective min balance. Let me know if you need more information / help and I am really looking forward to see / test your implementation in work!
@renepickhardt @stefanwouldgo I have this implemented in a draft PR (#1227) with an open question on the cost function: https://github.com/lightningdevkit/rust-lightning/pull/1227#discussion_r779693401.
In 0.0.103 we landed a pretty naive scorer to avoid re-routing over the same channel over and over again on retries. #1166 adds a bit better scoring, but we've still got a ways to go. We should probably, at least
Scorer
s across instances, which may lead to wanting to use the same channel in opposite directions. Ideally we'd consider the reason before deciding whether to punish both sides of a channel or only one, but sadly "temporary_channel_failure" can mean both "peer offline" and "amount too large" and those are the two most common reasons. Not sure how to progress here but it may require some experimentation.