lightningdevkit / rust-lightning

A highly modular Bitcoin Lightning library written in Rust. It's rust-lightning, not Rusty's Lightning!
Other
1.11k stars 345 forks source link

Revisit MPP splitting heuristic #1276

Closed TheBlueMatt closed 1 year ago

TheBlueMatt commented 2 years ago

We tend to be pretty conservative and avoid splitting payments quite a bit. That's fine, but we can probably be a bit smarter about when we split. Once we land #1227 we can maybe also adapt the scoring API to return more information, including things like "you shouldn't send more than N over this channel". One way to accomplish that may be to have a RouterState enum passed to the scorer that is first set to RouterState::ConservativeFirstPass where the scorer can be very aggressive, refusing to route more than a small % of most channels' guessed capacity. If that pass finds a route we can still take the happy path and return it immediately, if not we can do the later passes with a different RouterState and then do....something to combine the paths selected across the different conservatism scales.

One thing I don't want to have is the state a lot of other nodes appear to have where they just always split all the time, which feels counter-productive and can cause other issues.

cc @jkczyz

renepickhardt commented 2 years ago

This is a very interesting issue and I will look very closely what results you will arrive at. Obviously the best split is computed via a min cost flow but I assume you want to avoid this as min cost flow computations are more expensive than candidate path generation via dijkstra... That being said I will leave you with a few high level observations and links to the papers that might be helpful.

According to formula (8) in our paper on probabilistic path finding you can see that the expected number of attempts E[n] that are necessary to deliver k parts will be E[n]=k/s where s is the path success probability. Unfortunately s is not constant for various paths and also changes with the amount that you allocate to the path which seems to make the formula unusable. (This is actually the reason why we arrived at the formulation via min cost flows in the follow up paper) However there are still a few useful things that we can learn from it.

  1. if you take the full amount and compute the shortest path on the uncertainty network and its probability s you will know that you will need E=1/s attempts without splitting.
  2. Assuming you can identify the two most likely candidate paths (again this can be tricky as this might change with the amount) and they have a probability of s_1(a_1) and s_2(a_2) with a_1 + a_2 = a the full amount to be sent you can compare E[n]=1/s(a) the expected number of attempt without splitting with E[n] = 2/(s_1(a_1) * s_2(a_2)) the expected number necessary with 2 splits. Similarly you can do it with 3 splits and so on. We have done this in formulas (16) and (17) in the probabilistic path finding paper for strong simplified assumptions (e.g. all channels are of equal size and we split the payment also into equally sized junks) to arrive at a closed analytical formula that tells us how many parts we wish to split
  3. Using the method of Lagrange multipliers you can actually with two candidate paths compute the exact split allocation between the most likeli and the second most likeli path. This will also work with 3,4 and more candidate paths. The only problem is that you will have to compute zeros of polynomials with increasing degrees which cannot be done analytically and becomes numerically also very quickly infeasible / instable. The problem with this approach besides the numerical instability is that with actual optimal splits you will also not have disjoint paths. If you formulate this with non disjoint paths you have actually written down exactly the min cost flow problem which you are trying to avoid which can be solved without looking at those polynomials.
  4. our main net experiments with the actual min cost flows solvers showed that we split into amounts that vary over several orders of magnitude meaning some payments over large channels will be larger than some payments via longer routs or smaller channels. This indicates that any splitting heuristic into disjoint paths is actually pretty far away from the optimal solution.

Given all that I guess the two most promissing approaches are the following.

  1. Go down the road from the second point of the above enumeration. And try to approximate the path probabilities of your splits. Again as you don't actually know optimal split amounts your approximation might be inacurate but it might still lead to some heuristic. So you could at least start with splits into equal chunks to get at least a feeling of when the expected number of attempts will be increasing again.
  2. Take a look at the smallest channel capacity of a reasonable path and look at the distance. Now say something like "I want always to have 90% channel success probability" which means that you can have 10% of the channel capacity. Given a 5 hop path and at least 90% channel success probability your overall probability for the path would be 0.9^5 = 0.59 leading to 1/0.59=1.69 attempts on average. Something like this could lead to a reasonable first split amount and the others could be taken from there.

As said I will look closely what ideas you have. I was trying this for a couple of months until I arrived at a reformulation of the min cost flow problem which leads to the exact solution. I was not able to find any heuristics that I found satisfying but I will be excited if you can detect something that is easy enough and works well.

renepickhardt commented 2 years ago

Cross post of what I wrote in the lnd issue:

I have published code for a piecewise linearized formulation of the problem that runs in less than a second and produces a reasonable approxination to the optimal solution. You can find it and play around with it at:

https://github.com/renepickhardt/mpp-splitter/blob/master/Minimal%20Linearized%20min%20cost%20flow%20example%20for%20MPP.ipynb

I think that approach - if all engineering details (propper handling of uncertainty network, channel reserves, htlc limits, pruning of piecewise linearized problem, adding fees to the cost function) are included - should be very practical for implementations and as discussed on the mailinglist allow for large payment amounts to be handled

TheBlueMatt commented 2 years ago

Cool! Would love to see a Rust/optimized version. Obviously a second for routing is impractical for most of our users, but providing an option to use a different routing algorithm entirely for larger-value payments may be a useful option.

renepickhardt commented 2 years ago

There is an implementation in rust: https://docs.rs/mcmf/latest/mcmf/ though it seems to be based on a simplex algorithm. My preliminary experiments with piecewise linear approximation last summer did not work as I used networkx min cost flow solver in python that is also based on the simplex algorithm. At least the simplex algorithm based python implementation allowed for floating point values which was comparible fast but messed up the quality of the results. Of course this might be different in the rust implementation and you could just give it a try.

The google lib seems to use a cost scaling algorithm. it should be easy enough to reimplement it in rust as the algorithm itself is pretty small: https://github.com/google/or-tools/blob/86d4c543f717fa8716a9d66c25143660668bf825/ortools/graph/min_cost_flow.cc

Note our original and slow implementation was a capacity scaling algorithm (while it has a very similar theoretical runtime to the capacity scaling approach it seems that the cost scaling approach is much more faster in practical situations)

There is a list of implementations of cost scaling algorithms on stack exchange. and a stand alone version (in c) without dependencies seems to be here: https://github.com/iveney/cs2/blob/master/cs2.c (though the license is not open) I guess you have ton consult a lawyer to find out if one is allowed to just follow that implementation to create a rust version.

With respect to your runtime concerns

As discussed on the Mailinglist: Especially for LSPs one can batch several payments that occur within a second into a single min cost flow computation. Given enough load on the LSP this approach should be cheaper than the cumulated CPU cycles that happen when invoking dijkstra calls for every single payment request.

For a single stand alone client that makes a payment once in a while I believe sub second computational time is still faster than onion rountrips thus the quality of the suggested paths should outperform a dijkstra based solution on wall-clock time (not on CPU cycles though)

Edit: Also the tests from Carsten Otto implied that the java bindings of the lib demonstrated 230 ms runtime for solving the min cost flow. As noted by myself on the mailinglist we can probably easily reduce the size of the search space / network by a factor of 10 which also reduces the runtime again significantly and should push it far below 100ms. What runtime do you wish to achieve and how expensive are the dijsktra calls and splitting logic of this implementation now?

tnull commented 2 years ago

I'd be interested in having a look at this soon(-ish).

TheBlueMatt commented 2 years ago

Especially for LSPs one can batch several payments that occur within a second into a single min cost flow computation. Given enough load on the LSP this approach should be cheaper than the cumulated CPU cycles that happen when invoking dijkstra calls for every single payment request.

This doesn't improve latency, though.

For a single stand alone client that makes a payment once in a while I believe sub second computational time is still faster than onion rountrips thus the quality of the suggested paths should outperform a dijkstra based solution on wall-clock time (not on CPU cycles though)

I think you overestimate the number of round-trips required with a good scoring heuristic. Maybe this is reasonable as a "start running this in the background while the first attempt at paying is being tried".

Until we can get things down in the sub-100ms range it doesn't make all that much sense, and doubly not as long as it still needs base fee to be diminutive for search to work right.

TheBlueMatt commented 2 years ago

Until then, we should still look at the original topic of this issue, but getting into some crazy routing rewrite is not the topic of this issue :)

renepickhardt commented 2 years ago

Until then, we should still look at the original topic of this issue, but getting into some crazy routing rewrite is not the topic of this issue :)

I am still puzzled that you actually made that statement. The entire research on the optimally reliable and cheap payment flows paper was not centered around routing but the question of how we should split as we can clearly see from the original single line README.md and the name of the repository which is literally called mpp-splitter. and not crazy route rewrite. So all of my statements have been very very much on (the original) topic and not about some crazy routing rewrite and to be frank I am disappointed that you still missed that. Anyway you do you. I can't do more than presenting you the solution on a silver platter. I guess I should start charging for my consulting maybe it would be apprecated more then.

Since despite your (what I consider to be very very rude and disrepsectful) comments others like @tnull might still be interested in the progress. That is why I will note that @c-otto has posted on twitter (https://twitter.com/c_otto83/status/1506017622747488259) that he has implemented a basic version of our results on lnd at https://github.com/C-Otto/lnd-manageJ/issues/6 and he seems to be hacking slightly faster than I am with my progress on a c-lightning integration.

So far it is only the splitting logic and not the acutal payment loop and sending of onions. He computes a split of 0.5 BTC from his node to mine into 18 onions each of which has a probability higher than 50% to be deliverd. Meaning on the first attampt we can expect far less than half the onions to fail. his split can be seen at: http://web.archive.org/web/20220322080201/https://c-otto.de/rene.txt. As he indicated before the runtime of the solver was some 230 ms but I guess that is a factor 2 off for your boundary.

{
  "amountSat" : "50000000",
  "probability" : 2.1549649398107696E-4,
  "routes" : [ {
    "amountSat" : "1000000",
    "channelIds" : [ "726473x703x1", "723312x509x1", "695077x1924x0" ],
    "probability" : 0.7806365608992074
  }, {
    "amountSat" : "1240000",
    "channelIds" : [ "726473x703x1", "725453x2192x0", "693476x1211x1", "607678x1509x1" ],
    "probability" : 0.7832342855787678
  }, {
    "amountSat" : "150000",
    "channelIds" : [ "726473x703x1", "725453x2192x0", "700081x698x1", "695339x624x1" ],
    "probability" : 0.9765214420387514
  }, {
    "amountSat" : "3600000",
    "channelIds" : [ "726473x703x1", "725453x2192x0", "701584x2083x1", "558577x633x1" ],
    "probability" : 0.5628769427691509
  }, {
    "amountSat" : "1400000",
    "channelIds" : [ "726473x703x1", "725453x2192x0", "704776x2087x1", "558568x1324x1" ],
    "probability" : 0.782495149685495
  }, {
    "amountSat" : "2470000",
    "channelIds" : [ "726473x703x1", "725453x2192x0", "727359x1244x0", "633778x1065x0" ],
    "probability" : 0.721747937413449
  }, {
    "amountSat" : "2000000",
    "channelIds" : [ "726473x703x1", "727788x1249x1", "728428x605x1", "704834x1737x4" ],
    "probability" : 0.6813327353327034
  }, {
    "amountSat" : "1530000",
    "channelIds" : [ "726473x703x1", "727904x1129x0", "727359x1244x0", "633778x1065x0" ],
    "probability" : 0.8251278610944093
  }, {
    "amountSat" : "4000000",
    "channelIds" : [ "726473x703x1", "727904x1129x0", "676072x1060x0", "686282x1219x0", "691162x595x0" ],
    "probability" : 0.5740134573960525
  }, {
    "amountSat" : "4860000",
    "channelIds" : [ "726473x703x1", "727904x1129x0", "693476x1211x1", "689644x832x0", "691162x595x0" ],
    "probability" : 0.5006147390405348
  }, {
    "amountSat" : "2200000",
    "channelIds" : [ "726473x703x1", "727904x1129x0", "696023x1445x0", "703341x959x1", "703310x168x0" ],
    "probability" : 0.6664443121338184
  }, {
    "amountSat" : "1710000",
    "channelIds" : [ "726473x703x1", "727904x1129x0", "696023x1445x0", "717839x1372x2", "673116x1599x3" ],
    "probability" : 0.6797308387358896
  }, {
    "amountSat" : "3200000",
    "channelIds" : [ "726473x703x1", "727924x2119x1", "727724x1365x0", "711388x1010x1", "695339x624x1" ],
    "probability" : 0.5872281922328147
  }, {
    "amountSat" : "3350000",
    "channelIds" : [ "726473x703x1", "727904x1129x0", "725718x1671x0", "681034x1665x1", "723869x1660x0", "723869x1660x1" ],
    "probability" : 0.5780281931840187
  }, {
    "amountSat" : "3350000",
    "channelIds" : [ "726473x703x1", "727904x1129x0", "725718x1671x0", "693136x794x1", "701684x1234x1", "695339x624x1" ],
    "probability" : 0.5944870013390141
  }, {
    "amountSat" : "10090000",
    "channelIds" : [ "726473x703x1", "727904x1129x0", "725718x1671x0", "726475x755x1", "700845x2390x1", "550923x1215x0" ],
    "probability" : 0.31021424023863053
  }, {
    "amountSat" : "1000000",
    "channelIds" : [ "726473x703x1", "727904x1129x0", "727359x1244x0", "701970x2117x1", "701584x2093x1", "694267x2038x2" ],
    "probability" : 0.776652821615024
  }, {
    "amountSat" : "2710000",
    "channelIds" : [ "726473x703x1", "727904x1129x0", "725718x1671x0", "681034x1665x1", "706688x1385x0", "723785x2343x1", "723869x1660x1" ],
    "probability" : 0.5892691058206754
  }, {
    "amountSat" : "140000",
    "channelIds" : [ "726473x703x1", "727904x1129x0", "725718x1671x0", "689644x832x0", "691162x595x0" ],
    "probability" : 0.9827758955245893
  } ]
}
renepickhardt commented 2 years ago

Especially for LSPs one can batch several payments that occur within a second into a single min cost flow computation. Given enough load on the LSP this approach should be cheaper than the cumulated CPU cycles that happen when invoking dijkstra calls for every single payment request.

This doesn't improve latency, though.

Yes it very much does (see below) since roundtrip times of onions are by now from a time perspective more expensive than planning of onions to be sent. Also I am very surprised and of course delighted that latency suddenly becomes important to you. Last time when I suggested to add latency considerations in the cost function you didn't seem too keen on that. At least the issue is closed without a comment from you and the PR that supposedly resolved that issue did to the best of my knowledge not do that yet. (To be fair the latency considerations for the cost function have been kind of off topic in that issue)

For a single stand alone client that makes a payment once in a while I believe sub second computational time is still faster than onion rountrips thus the quality of the suggested paths should outperform a dijkstra based solution on wall-clock time (not on CPU cycles though)

I think you overestimate the number of round-trips required with a good scoring heuristic. Maybe this is reasonable as a "start running this in the background while the first attempt at paying is being tried".

I think you overestimate the number of round-trips required with a good scoring heuristic. Maybe this is reasonable as a "start running this in the background while the first attempt at paying is being tried".

No I do not overestimate this at all! It's literally one of the thing I do most: Evaluate the accuracy of our results and how it differs.

Let's assume some very simple numbers here. Median onion roundtrip time is x seconds (I assume you know much better what x to chose than I) the min cost flow computation takes 230 ms which is y times more than your heuristic + dikstra. Now: How much worse in number of failed attempts can the quicker computation be on order to statistically deliver payments faster? I am sure you can do that 9th grade text book exercise yourself but I am sure you will realize that your heuristc has to be very close to the optimal solution because failed attempts are expensive from a runtime and for the network or the onion roundtrip time would also have to be sub seconds which I believe it isn't.

Just to make this more expressive here is an output from my log file:

First I optimize solely or reliability:

Round 1
Success: True    fee: 19935.000msat  p = 63.42%   amt: 5000000sats path hops: 4
Onions sent: 1
total_fee: 19935.000 msat

Number of mcf computations: 1
Total number of onions sent to deliver the payment : 1

In contrast when I optimize for fees and raliability

Round 1
Success: True    fee: 3340.800msat  p = 52.06%   amt: 1200000sats path hops: 5
Success: True    fee: 4277.770msat  p = 26.03%   amt: 1610000sats path hops: 5
Success: True    fee: 2099.820msat  p = 58.49%   amt:  790000sats path hops: 6
Success: True    fee: 3894.800msat  p = 51.87%   amt: 1400000sats path hops: 6
Onions sent: 4
total_fee: 13613.190 msat

Number of mcf computations: 1
Total number of onions sent to deliver the payment : 4

Of course I also compared it with optimizing mainly for fees

Round 1
Success: True    fee:  595.400msat  p = 44.92%   amt:  260000sats path hops: 4
Success: True    fee: 3380.480msat  p = 24.75%   amt: 1390000sats path hops: 5
Success: True    fee: 1641.510msat  p = 11.85%   amt:  690000sats path hops: 6
Success: True    fee:  498.330msat  p = 68.77%   amt:  210000sats path hops: 6
Success: False   fee: 3723.200msat  p =  5.27%   amt: 1600000sats path hops: 6
Success: True    fee:   23.570msat  p = 98.84%   amt:   10000sats path hops: 7
Success: True    fee: 1862.820msat  p = 32.93%   amt:  790000sats path hops: 8
Success: True    fee:  118.800msat  p = 91.37%   amt:   50000sats path hops: 9
Onions sent: 8
total_fee: 11844.110 msat

Round 2
Runtime of flow computation: 1.36 sec 
Success: False   fee:  778.600msat  p = 32.20%   amt:  340000sats path hops: 4
Success: True    fee: 3161.600msat  p = 28.80%   amt: 1300000sats path hops: 5
Success: True    fee: 2444.440msat  p = 53.12%   amt:  920000sats path hops: 5
Success: True    fee: 1641.510msat  p = 11.85%   amt:  690000sats path hops: 6
Success: True    fee:  498.330msat  p = 68.77%   amt:  210000sats path hops: 6
Success: False   fee: 1605.630msat  p = 41.68%   amt:  690000sats path hops: 6
Success: True    fee:  628.560msat  p = 72.88%   amt:  270000sats path hops: 7
Success: True    fee: 1226.160msat  p = 50.25%   amt:  520000sats path hops: 8
Success: True    fee:  142.560msat  p = 89.71%   amt:   60000sats path hops: 9
Onions sent: 9
total_fee: 12127.390 msat

Round 3
Success: True    fee:  595.400msat  p = 44.92%   amt:  260000sats path hops: 4
Success: True    fee: 3161.600msat  p = 28.80%   amt: 1300000sats path hops: 5
Success: True    fee: 4277.770msat  p = 26.03%   amt: 1610000sats path hops: 5
Success: True    fee: 1641.510msat  p = 11.85%   amt:  690000sats path hops: 6
Success: True    fee:  498.330msat  p = 68.77%   amt:  210000sats path hops: 6
Success: True    fee:  814.800msat  p = 65.92%   amt:  350000sats path hops: 7
Success: True    fee: 1037.520msat  p = 56.39%   amt:  440000sats path hops: 8
Success: True    fee:  332.640msat  p = 77.24%   amt:  140000sats path hops: 9
Onions sent: 8
total_fee: 12359.570 msat

Number of mcf computations: 3
Total number of onions sent to deliver the payment: 26 
TheBlueMatt commented 2 years ago

Until then, we should still look at the original topic of this issue, but getting into some crazy routing rewrite is not the topic of this issue :)

I am still puzzled that you actually made that statement.

Please don't over-read tone out of text, that's a recipe for constantly being insulted on the internet without anyone intending anything.

To clarify what was meant here (apologies if it used some secondary meanings in casual English instead of strict meanings, which may have been misleading). by "the original topic" I meant a heuristic by which we split in a dijkstras router, which is something we'd like to revistit in our router, by "crazy routing rewrite" I meant a rewrite of the router using a completely different algorithm, where "crazy" here meant "large", not "bad".

I can't do more than presenting you the solution on a silver platter

That said, please stop claiming your solution here is "the" solution - it is not, it is a solution, that performs well in certain dimensions, and performs poorly in others.

TheBlueMatt commented 2 years ago

As he indicated before the runtime of the solver was some 230 ms but I guess that is a factor 2 off for your boundary.

That's awesome to hear! That's much more realistic than the 1 second you'd quoted above. We should open an issue to track potentially providing a non-dijkstras router in the future to track it.

Yes it very much does (see below) since roundtrip times of onions are by now from a time perspective more expensive than planning of onions to be sent.

Yep, but in many cases its important to note that a second may exceed the time to send, if the path is short/through "good" nodes, which is why I highlighted one second being problematic in ~most payments, at least for well-connected nodes paying other well-connected nodes.

Also I am very surprised and of course delighted that latency suddenly becomes important to you.

Please don't complain about feeling insulted and then turn around and make comments like this.

No I do not overestimate this at all! It's literally one of the thing I do most: Evaluate the accuracy of our results and how it differs.

My point was a common lightning use-case today is a well-connected node (eg user behind an LSP or a large provider) paying a well-connected node (ie a large merchant processor), in which contexts round-trips are generally much more rare, just looking at numbers from pleb-net-style nodes is a bit misleading/just one deployment model of lightning, and far from the only one (but of course one that's important and that may thus want to have a different router!).