lightningdevkit / rust-lightning

A highly modular Bitcoin Lightning library written in Rust. It's rust-lightning, not Rusty's Lightning!
Other
1.17k stars 368 forks source link

Some thoughts on payment splitting heuristics for multi part payments #2276

Open renepickhardt opened 1 year ago

renepickhardt commented 1 year ago

This is based on some out of band communication with @TheBlueMatt:

In #1276 there was the damand to revisit the splitting heuristic of payments in LDK. Back at that day I argued that the splitting problem is exactly what whould be solved throuph optimally reliable payment flows which are proposed to be added in #1668. However I think I have to withdraw or at least weaken some of my claims from before as I will lay out in this issue.

The following originated from the comments by @TheBlueMatt on the mailinglist :

While its much less capital-effecient, the ability to over-commit upfront and then only allow the recipient to claim a portion of the total committed funds would substantially reduce the impact of failed HTLCs on payment latency. Of course the extra round-trip to request the "unlock keys" for the correct set of HTLCs adds a chunk to total latency, so senders will have to be careful about deciding when to do this or not.

Still, now that we have onion messages, we should do (well, try) this! Its not super complicated to implement (like everything it seems, the obvious implementation forgoes proof-of-payment, and like everything the obvious solution is PTLCs, I think). Its not clear to me how we get good data from trials, though, we'd need a sufficient set of the network to support this that we could actually test it, which is hard to get for a test.

Maybe someone (anyone?) wants to do some experiments doing simulations using real probing success rates to figure out how successful this would be and propose a concrete sender strategy that would improve success rates.

While digging into redundant overpayments I figured out several noteworthy things:

  1. Minimum cost flows may not yield the maximum expected value to be delivered. A counterexample can be seen at: https://github.com/renepickhardt/Maximum-Expected-Value-Flows-For-Redundant-Overpayments/blob/main/img/counterExample.png
  2. Redundant overpayments can be understood as: "At least k out of n payment parts need to successfully reach the recipient". Thus the method of bernoulli trials can be applied (as orginially suggested in the probabilistic path finding paper for non redundant payments)
  3. While in the above mentioned paper I made poor / strange assumptions to the cost function and channel size to develope the theory I now observed that the only neccessary variable for a bernoulli trial is that the success probability for each path has the be (roughly) the same which can be controlled for by defining a target probability for the various paths and assign the sats for that onion / path such that the estimated success probability reaches the target probability.
  4. These observations lead to a natural splitting heuristic - which seems reasonable (in particular but not only for redundant overpayments) and useful in simulations but I don't have the infrastructure or resources to test this in mainnet.

Summary of the splitting algorithm

This is currently WIP and many choices are adhoc but it shows the princible. the foolowing is a summary of: https://github.com/renepickhardt/pickhardtpayments/pull/36

We use dijkstra to compute candidate paths as shortest paths with respect to our cost function

  1. Use a cost function for a unit that is the combination of linearized routing cost and linearized uncertainty cost
  2. smooth the routing cost (currently laplace smoothing with +100. The exact value needs to be found through tests and feature engineering
  3. thus: cost = (ppm + 100) * 1/capacity
  4. After a path is planned increase cost of each edge with a multiplier

with respect to allocation of funds to paths

The cost function favors paths with low routing costs thus the only question is how many sats to allocate? The following idea shall be used for motivation:

Comparison to existing splitters

Existing splitters operate by the following ideas

Non of the above mentioned ideas start with reliability and can be used to derrive service level objectives. With the method that I proposed one can start with a clear service level objective and get single parts that have all the same success probability and obviously the single parts can be derrived with cost functions that focus more on fees or more on reliability or any other feature that one may find desireable

While this is not directly an issue I thought I drop this here as there was previously the discussion about splitting heuristics and as mentioned I have to withdraw some of by too strong claims statements.

TheBlueMatt commented 1 year ago

Mmm, interesting, yea, that's nifty, thanks! We could probably do this by pushing the decision of how much to send over a channel into the scorer and then the router just does whatever the scorer says. From there we could configure the scorer to do the above calculation, or something else based on whatever its probability knob says.

One related thing is I want to move to a nonlinear probability distribution soon. Been playing with it a bit and absent a better idea will probably do a distribution of something like (a/c - 0.5)**4 (or 6 or 8, dunno which exponent yet), mostly because its super easy to calculate the integral of in nice fast fp math (the ^4 one implies 3 fp multiplies to get a ^5 for the integral or 6 or 8 are 4 fp multiplies). I'm not quite sure how to fit that into the above in a way that's fast but I also haven't thought deeply about it, and we could change to a different function as long as we can calculate it fast.

renepickhardt commented 1 year ago

One related thing is I want to move to a nonlinear probability distribution soon. Been playing with it a bit and absent a better idea will probably do a distribution of something like (a/c - 0.5)**4

Dear @TheBlueMatt while the question of the non linear probability function is unrelated to the main point of this issue I thought I ask you for a review of ongoing work that I have just pushed to my git Repo: https://github.com/renepickhardt/Maximum-Expected-Value-Flows-For-Redundant-Overpayments/blob/main/code/A%20Collection%20of%20Probabilistic%20Channel%20Models%20for%20Augmenting%20Maximum%20Expected%20Value%20Paths.ipynb In there I investigate 4 different probabilistic channel models for the liquidity distribution. While this is all early and ongoing work I am pretty sure to show arguments of why a bimodal model for the probability function (as suggested to be used by LND) is probably not too useful (even if it were closest to match reality). I write this here as you suggested to use even polynomials as bimodal non linear models for the pdf. I have suggested other models with drain that seem to have more desireable properties.

Disclaimer: While the notebook seems large most of it is boilerplate code. The 4 models in sections 1.1 to 1.4 are all less than 30 lines of code without dependencies. So I'd be curious for your feedback and thoughts.

TheBlueMatt commented 1 year ago

Cool! That's really related to #2547 (which is indeed exponential), which I'd love to iterate on if you have a suggestion for a great function. One further constraint here is we really want a function we can very efficiently compute the cdf for, which is most of the argument for exponential (which just requires a few doublings) rather than it being some optimal selected pdf. In incredibly informal tests it appears to work quite well when compared to a constant pdf, though I don't have a formal analysis to provide. I believe the mutiny folks are starting to gather a bunch of data which we can use to analyze prospective pdfs.

renepickhardt commented 1 year ago

To be frank I don't understand why quick compuation of the function is important. Let me elaborate:

In my experience it is sufficient to know the function on a normalized channel of 100 sats capacity. Channels of any size can use a coordinate transformation to map to the 100 sats standard Chanel. Meaning I could have a lookup table for the 100 sats Chanel as long as my coordinate transformation is easy to compute (which should literally just be division by Chanel capacity)

With respect to iterations and knowing a great function I think with real world data one can easily test which of the models is most suitable and fit the Model parameters. In fact I would not fix a Model but with the ongoing stream of observations always adopt and readjust the fitted Model parameters

TheBlueMatt commented 1 year ago

Sure, a lookup table is a perfectly reasonable way to make most functions fast, as long as we're normalizing it against the bounds of the liquidity and not doing something which involves the min+max bounds and the channel's capacity, which could start to blow up the parameters :).

With respect to iterations and knowing a great function I think with real world data one can easily test which of the models is most suitable and fit the Model parameters. In fact I would not fix a Model but with the ongoing stream of observations always adopt and readjust the fitted Model parameters

Yep, I just don't have real world data today. At least not in a reasonable format. I think the mutiny folks are starting to gather such data and we can fit to that later.

renepickhardt commented 1 year ago

In the above mentioned notebook I have - in contradiction to what I have explained in the initial post - extended the algorithm to allocate sats on a path in a way such that the allocated amount maximizes the expected number of delivered sats. (instead of the partial payment reaching a certain probability threshold as initially suggested)

While ldk with default settings seems to select 25% of min capacity along the candidate path and eclair seems to choose randomly between 20% and 100% of that bottleneck (which statistically is 60% on average) preliminary results in a simulated setting indicate that LDK splits a bit too aggressively while eclair splits not aggressively enough.

In the very preliminary diagram you can see how the three allocation methods (ldk, eclair and maximum value allocation) for splitting payments compare for a cumulative number of successful payments. Those experiments are on the same set of payment pairs, amounts and liquidity graph. (The exact way of how they were chosen comes in the soon to be published paper, payment amounts were in the 100k to 10M sats range)

Bildschirmfoto 2023-11-22 um 16 03 42

We can see that eclair has the most successful payments when payment success with 4 or less partial payments is possible. Which was true for roughly 60% of our test set. LDK overall needs more shards to deliver a similar amount of payments as eclair which makes sense (given that LDK allocates smaller amounts). However the maximum expected value method can overall deliver more payments (and uses on average more shards than LDK but less than Eclair) and thus seems to find a sweet spot.

As the method of estimating the maximum expected value of a path is very easy to implement and almost as simple to compute as the 25% of the bottleneck I would be very delighted to collaborate in some mainnet A/B testing for the final publication.

tnull commented 1 year ago

In the above mentioned notebook I have - in contradiction to what I have explained in the initial post - extended the algorithm to allocate sats on a path in a way such that the allocated amount maximizes the expected number of delivered sats. (instead of the partial payment reaching a certain probability threshold as initially suggested)

Thanks for looking into this!

While ldk with default settings seems to select 25% of min capacity along the candidate path and eclair seems to choose randomly between 20% and 100% of that bottleneck (which statistically is 60% on average) preliminary results in a simulated setting indicate that LDK splits a bit too aggressively while eclair splits not aggressively enough.

Hm, I think generally we may want to double-check our MPP logic a bit more fundamentally in the future or at least give other models a shot which we then can A:B test with. However, as a short-term improvement, we could consider reverting https://github.com/lightningdevkit/rust-lightning/pull/1610/commits/ff8d3f7ba454b957a22f80f2ccb1dcef81e470e5, i.e., go back to defaulting to 50% channel saturation?

@renepickhardt Would your simulations indicate that this is preferable over 25%?

renepickhardt commented 1 year ago

I can certainly run the experiments at some point in time with different percentages. But to be fair: I am using synthetic data in an simulated environment anyway and a lot of this depends on the choice of the probability distribution for the liquidity of the synthetic data.

As you may have seen I provided a rough patch for eclair that shows how easy it is to compute the allocation for a partial payment based on the maximum expected value instead of a fixed or random value. That patch is less than 10 lines of code. I assume it would be similarly easy to do in LDK as you should also have all the necessary data available at the moment when you compute the allocation.

Thus I think it might actually be less effort to just write a similar patch for LDK and confirm the better performance in comparison to the current scheme on mainnet instead of guessing / fine tuning a parameter for a scheme that may be exchanged eventually anyway.

TheBlueMatt commented 1 year ago

Nice! After #2551 lands (hopefully this week), we should think about letting the scorer, rather than the router, make MPP splitting decisions. This would both simplify the router, letting us more easily experiment here, and give us potentially more information when we make those decisions, especially around current available liquidity.