think about when to send routes without curves

interledgerjs / ilp-connector

Reference implementation of an Interledger connector.

Other

133 stars 59 forks source link

think about when to send routes without curves #344

Open michielbdejong opened 7 years ago

michielbdejong commented 7 years ago

now that we have remote quoting, it can make sense for connectors to sometimes send out routes without curves. the current network is probably still small enough not to need this, but it may be something to think about longer-term

michielbdejong commented 7 years ago

probably, actually, currently all routes are broadcasted with curves, and these curves are always used to determine which route to use, but never to determine what quote to give. So that's a slightly weird situation, I think. Each ilp-kit has local copies of all the curves, but doesn't use them for quoting.

momerath42 commented 7 years ago

Indeed, it is a weird (and I hope very temporary) situation. I'm not sure what small steps we can take to greatly improve things; I think a qualitative jump in routing features/quality (as opposed to bug-fixing and optimization) will require a pretty major redesign.

michielbdejong commented 7 years ago

what if we just disable remote quoting?

michielbdejong commented 7 years ago

/me ducks ;)

emschwartz commented 7 years ago

Each ilp-kit has local copies of all the curves, but doesn't use them for quoting.

Is this true? I thought it does use the local liquidity curve for quoting when it has it and only does a remote quote when it does not have the curve.

Even if we are not currently using remote quoting because everyone is broadcasting their curves to one another, I don't see any reason to take this functionality out. We know it will be useful if someone wants a more complicated fee structure that they don't want to broadcast to the entire network. If this isn't causing problems now I don't think we should spend time discussing it.

michielbdejong commented 7 years ago

Is this true? I thought it does use the local liquidity curve for quoting

let me have a look at the code, that's the only way to know for sure, I guess :)

michielbdejong commented 7 years ago

RouteBuilder#getQuote uses Ledgers#quote uses ilp-core quote, which is remote.

After that, slippage is added and the curve from the remote quote response is added to the connector's "own" quote response, but it doesn't seem to look at its locally cached curves at all in the getQuote method.

It was changed in this commit, 9 months ago, you can see the 'find next hop' code was removed there. I added a question about the title of that PR, as it seems misleading to me (but maybe I misunderstood what the code does!).

Remote quoting can be used in 3 situations: 1) source-side, where a "default route" is configured. This just makes the connector act as a sender for remote routes, so that's an easy case which doesn't affect route quality or quote quality, and although it's not currently used on by any live ilp-kit nodes, you could easily imagine someone would want to try this out. 2) destination-side, broadcast routes without curves. This is currently neither supported for broadcast-sending, nor for broadcast-receiving. Also, we probably don't need this yet since each ilp-kit only have one local FiveBells ledger, and not many complex subledgers. If we do decide to start using this (hence the title of this issue), then we should think about the impact of that - maybe we need some other measure to compare which one of two curve-less routes is really cheapest. 3) to cover up errors in the curves that were cached from route broadcasts. If the locally cached curve is wrong, then doing a remote quote will cover up the errors in it. IMHO if this is really what we are currently doing (whether by design or due to a misconfiguration), then we should stop doing it.

emschwartz commented 7 years ago

I think I understand why it works this way. It assumes that you're sending to an address on the periphery (illustrated in the diagram below). This model assumes that all nodes in the core have all the routes to everyone else in the core. This means you need to get a "tail quote" for the leg from the far end of the core to the destination account. It joins the tail curve with the local one and it looks like it uses the first hop from the local table to figure out the source amount for the quote.

Right now the ILP kit network is all core, no periphery, because all nodes broadcast all routes to one another and save everyone else's routes. Not sure though what this means about whether in practice the quotes are determined by the local tables or the remote quote.

emschwartz commented 7 years ago

It doesn't sound to me like remote quoting is the problem. It sounds like it may be using remote quoting when it shouldn't or doesn't need to. The ability to do remote quoting should not be disabled but we should make sure it is being used in the correct circumstances.

michielbdejong commented 7 years ago

sorry, by 'disable', I did not mean 'remove', I just meant 'use only in the correct circumstances' (so only 1. and 2. and not 3.)

justmoon commented 7 years ago

It doesn't sound to me like remote quoting is the problem. It sounds like it may be using remote quoting when it shouldn't or doesn't need to. The ability to do remote quoting should not be disabled but we should make sure it is being used in the correct circumstances.

I think it's a very good idea for now to have it always use remote quoting. By keeping routing and quoting separate, the system becomes a good bit simpler and more robust. If they are separate, you can't get a quoting failure because we changed something about routing and vice versa.

Right now, we often get an error that there is no path found at all. So conceptually we should at least aim to design a system that will find some valid, resonably short path if one exists.

From that baseline, we can add optimizations:

The first optimization is to think about how we can prefer cheaper paths over more expensive ones (in circumstances where doing so is safe.)

How important this optimization is depends largely on whether there is enough of an incentive to have low fees otherwise. It's plausible that the choices people make while peering and when manually configuring routing tables could be good enough to create competition.

Due to the difficulty we've had getting the current routing protocol working reliably, we should consider a simplified protocol that does not broadcast liquidity curves at all, but rather routes purely on a shortest-path basis. We would still want to ensure that connections are removed from the graph if they have run dry, but this can be a local connector function: If liquidity < $10, stop advertising this route.
Another optimization is to use information that has already been exchanged via the routing protocol to skip quoting or improve quoting latency.

I think this is not something we should attempt until the routing protocol is pretty stable and fleshed out. If quoting expects information in a certain format (e.g. a liquidity curve), it puts constraints on our design of routing which in turn makes this harder to iterate on. Note that quoting is supposed to be the more stable of the two, so any dependency of quoting on routing is going to be a PITA to maintain as we iterate routing.

Finally, it is not clear to me that this optimization is a good idea at all. In IP, there is something quite similar to quoting: Path MTU Discovery The MTU is a property of a path between two endpoints that the sender might wish to discover, just like an ILP sender would want to discover the liquidity curve of a given path. The IETF could have tried to include MTUs in routing tables, but this would have created the exact nasty dependency I've outlined in the previous paragraph. Interestingly, even when they had a chance with IPv6 to redesign it and move it into routing (as an optimization), they went the opposite direction and made it a fully end-to-end functionality. This isn't the most efficient way of doing it, but the simplest and most general.

It's plausible to me that this is simply not worth it as an optimization. Note that caching quotes would have nearly the same performance benefit at a fraction of the complexity cost. If my users are making enough payments to a certain destination ledger that it is worth it for me to have the liquidity curve in memory, there will also be enough payments to that destination that most quote requests are cache hits. As a connector, I can even decide to actively keep the cache warm. Doing it that way creates no dependency on routing and no dependency on IPR etc. (like tail quotes would have) and my intuition is it gives you 99.9% of the performance benefit.

In summary:

Quoting should NOT use liquidity curves from the routing tables for the foreseeable future because we don't want to create a dependency between quoting and routing.
We should consider a simplified protocol that does not take liquidity curves into account at all. (Credit to @momerath42 and @michielbdejong who I believe suggested this before I did.)

adrianhopebailie commented 7 years ago

I like the separation in IP between what are sometimes called the routing and routed protocols. i.e. Management protocols between nodes as opposed to the protocols responsible for actually delivering traffic (or in the case of ILP, value). Exchanging of route data is a routing protocol whereas quoting is not really, it is something new (like doing a traceroute performed before sending a packet) that probably qualifies more as a routed protocol.

Perhaps its worth differentiating between the protocols and the functions of the nodes/gateways. Because quoting and routing are both functions performed by a node but those functions could be achieved in a number of ways.

When a node receives an incoming transfer it must route it, which it does based on its routing tables. How the data got into the routing table is a separate concern. As @michielbdejong and I discussed the other day, IP has numerous routing protocols and I expect we'll end up having more than 1 too. Some may have optimizations like attaching liquidity curves to the routing data but those should not be seen as a core requirement of a routing protocol. Perhaps we should be wary of labelling what we come up with today as "The Routing Protocol"?

Likewise, when a node receives an incoming quote request it must calculate a response either by first routing the request down one or more routes to other connectors and then adding a spread to the response amount or by using some internal heuristics (like cached liquidity curves). Again, the protocol for exchanging the data that might be used as part of this function is a new protocol(s), or an extension to another connector-to-connector (routing) protocol.

So, with regards to this, and looking at things purely from the perspective of defining the protocol specifications:

Quoting should NOT use liquidity curves from the routing tables for the foreseeable future because we don't want to create a dependency between quoting and routing.

I don't see using liquidity curves as a protocol feature but more of an implementation optimization. The core protocol requires that nodes can request a quote from their peers and we shouldn't make it more complex than that.

The protocol only says that a node that gets a quote request must return a response. It doesn't say how it gets that response so if nodes find that they can effectively respond to quote requests based on cached liquidity data then that will drive the development of a mechanism to exchange that data.

michielbdejong commented 7 years ago

Curves in route broadcasts is what we currently have, and it's a direct result of the "complexity is in the connectors" design choice. We should only switch to curve-less route broadcasts it if we think we're on a dead end with the current design.

In the current network topology of the open Interledger, it's very easy to let each node hold all the routing tables including curves. So I think we should stick with curves in route broadcasts.

Advantages of keeping curves in route broadcasts:

less work, because it's the code we have
if we keep refactoring all the time, bugs will not go away, they will be replaced by new ones
curve-less route broadcasts may be more efficient for announcing exotic ledgers in bulk, but remote quoting adds a cost at the time that the user has already clicked the button and is waiting; better keep the network in an always-ready-for-action state
algorithmically, building all multi-hop routes for a network is less work if you build them all at the same time; when new information becomes available in one node, it is wise to forward this information immediately, because it will save other nodes work if they are better informed, sooner.

adrianhopebailie commented 7 years ago

We should only switch to curve-less route broadcasts it if we think we're on a dead end with the current design.

In terms of the ILP Kit design that's fine, your points are valid.

In terms of defining and documenting routing protocols I'd favor documenting a protocol that excludes them and is very simple and efficient.

momerath42 commented 7 years ago

I've been thinking of the curves-in-route-broadcast as a vestigial organ, which would be painful to remove, out of proportion to the gain of removal. Regardless of whether we want to remove it from the implementation in the near future, I agree with @adrianhopebailie that the specification shouldn't include it, but leave the question of how a connector maintains and makes use of its routing table to implementations and/or later specifications.

If we do want to remove them from the implementation, for clarity and (eventual) robustness gains, the first thing we should do is replace the use of them in choosing who to get a quote from. That component could range from dead-simple-and-not-really-any-better-than-what-we-have to ensemble-of-ML-algos-produced-over-man-years. It's not clear to me where the knee of that curve is.

justmoon commented 7 years ago

Curves in route broadcasts is what we currently have, and it's a direct result of the "complexity is in the connectors" design choice. We should only switch to curve-less route broadcasts it if we think we're on a dead end with the current design.

Yes, that's my understanding as well. Ideally, we would fix the routing protocol with minimal changes. Right now I'd give it about a 60% probability that we can make that work. We know the current least-cost routing will not work, because a single bad node could blackhole the entire network's traffic by publishing a single unrealistically good rate. At a minimum we have to change it to shortest-least-cost routing, i.e. out of the set of shortest (fewest AS hops) paths, which one has the least cost? But if that works out, we would still have liquidity curves for non-aggregated routes.

But even if the routing protocol can provide liquidity curves locally, I'd still not want to use them for quoting. For a couple of reasons:

It puts additional constraints on the routing protocol - for example:
- For routing, it might be fine if the liquidity curves are updated once every couple of hours. Sure, there may be a cheaper connector, but it may be more important to dampen route fluctuations. But for quoting, liquidity curves have to be up-to-date.
- For routing, we don't have to send an updated liquidity curve at all if we don't think that it will affect the paths taken. For quoting, we do.
- For routing, curves don't need to be nearly as accurate as for quoting, so we might be able to get away with far more compressed curves (or even simple ratios) if we don't use the curves for quoting also.
- For routing, we don't necessarily need to keep the curves in memory. If we don't expect to receive a competing curve to compare to the one we currently use (e.g. because there is only one path for a given destination), then we may be able to discard them as soon as we've calculated and broadcast the effects on our routing table. (Not saying we should do this, just giving examples to illustrate that adding quoting as a use case impacts many decisions regarding the routing protocol.)
What is a ledger for routing purposes and what is a ledger for quoting purposes may not coincide. For instance, we were talking about using ledger prefixes to distinguish transaction types with different fee policies. Suppose we have a bank with three transaction types and three different fee levels. During quoting, it will respond with different liquidity curves for each fee type and set appliesToPrefix accordingly. However, for routing, these distinctions are irrelevant - all payments go through the bank's connector, so the bank would likely just broadcast its entire prefix with a liquidity curve representing its currency, which does not contain those fees.
If we make providing curves to the quoting system a use case for our routing protocol, we have to consider quoting in all discussions about the routing protocol. Designing the routing protocol is difficult enough. If we can avoid bringing in the cross-cutting concern of using the same curves for quoting, we should do so.
If there is a connection between the two, every change to the routing risks introducing a new bug by changing the data that the routing protocol supplies to the quoting code in some subtle way.
This optimization will become worthless once we start aggregating routes.
Caching quotes (possibly with active prefetching) is available as a more robust alternative.

momerath42 commented 7 years ago

Yes, that's my understanding as well. Ideally, we would fix the routing protocol with minimal changes. Right now I'd give it about a 60% probability that we can make that work.

I'm unclear what you mean about this. From my perspective the routing part of things is already quite solid, if the goal is for every node to always have a route to any of the destinations that should be available in reality (I've tested every nonisomorphic topology of 7 or fewer connectors, and we're now 39% of the way through the 8 node graphs). My uncertainty about the achievement of this goal is in the realm of failure-modes caused by network connections coming up and going down with problematic timing.

The original goal of the implementation I hacked up (in a minimally-invasive way, hence the vestigial curves in route-broadcast messages) went beyond this, into multi-criteria pathfinding. And I think, ultimately, routes will be chosen by such a method (and support for it might be required as far down as route-broadcast). But as per our plan when I started hacking on the js implementation, routing now resembles distance-vector protocols, and quoting is (or is supposed to be) remote-only.

So, what's missing in the short term, from my perspective, is a better, simple route-selection mechanism, that doesn't rely on the broadcast curves (which we can then remove), and a quote-caching system (which might well be a requisite of a route-selection method).

michielbdejong commented 7 years ago

The current situation is based on honesty: each connector broadcasts a curve, and a sender has to trust each connector along the path not to steal a little bit during the remote-quoting phase.

That needs to change, because we want the network to at the same time be open to unvetted participants, and to handle real money.

If routing and quoting are decoupled, then you need to trust each connector along the route; it will become very easy for connectors to always charge high connector fees.

Thus far, I think we agree, right?

Now, the question is, how do we improve the current situation. :)

My suggestion would be: use the liquidity curves from the routing broadcasts for quoting, and add peer metrics. That way, a connector that announces cheap routes will have to route payments at that cheap price, and will either lose money or have a high failure rate, meaning it will be expelled from the network.

Other metrics (apart from price and payment success rate stats from the peer metrics) which I think could be relevant for route selection would be latency, and maybe throughput (although that's probably hard to measure other than through success rate under high traffic).

I do agree that the current two-objective routing tables are not the simplest solution. As a simplification, to get a single-objective cost function, we could route based on "required source amount for sending 1 USD of destination amount within 10 seconds, with 95% success rate".

Another option would be to always remote-quote all possible paths, through the whole network, and not just over one path, but I think we agree that that would lead to a lot of unwanted message sending storms.

We can use # hops and split-horizon as heuristics to avoid loops, but IMHO if you use it as the cost function, connector fees in the network will no longer be subject to competition, and thus become very high.

momerath42 commented 7 years ago

Reminder: the broadcast curves aren't updated, so you shouldn't trust them as a metric for route choice over long time periods anyway, even if everyone was being fair, but rates were changing.

I think we can do pretty well with efficiency and bad-actor tolerance, without having curves be part of routing, and without any algorithmic improvement over path-vector (or even distance-vector). This is coming from someone who'd love to spend all his time reading papers and trying to cobble the best ideas together into greenfield prototypes.

Connectors know when they have multiple next-hop options for a given destination. They don't currently know about downstream topology[1]. At startup, they receive all the quotes they might want, in a push fashion. Whether those quotes are updated on-demand (low-transaction volume and/or high volatility) or by a subscription mechanism, obviously there's a practical limit to how many {next-hop, destination} pairs we want to be keeping an up-to-date picture of (or waiting for a response from for a given payment in the low-volume case). To me, that speaks to the separation of liquidity-curve info from establishment of routing tables.

Without really digging in, I think it's best seen in reinforcement learning terms: trying to balance explore and exploit. In an established network, I would expect the accepted 'fair' exploitation of the system's properties to cross well into what would be disruptive behavior on the current network. More importantly, I don't see any 'ledge' in the slope between a network of trusted peers, and that eventual future. I don't know if there are any purely technical solutions, and I'm doubtful that even given natural incentives, and an acceptable level of loss for any given relationship (did I just define trust lines?), that there is a simple metric, which if it were widely adopted, wouldn't be subject to disruption-for-profit, followed by scrambling to adapt. So, I start thinking about ensemble systems involving PID loops and anomaly detection... and I never write up a simpler design.

I'd argue for some transparency into that, along the lines of BGP, but for network stability reasons: it makes loop detection trivial.

michielbdejong commented 7 years ago

In this network:

           1- C -1
         /         \
A --3-- B  ---3---  D

route ABD has cost 6 and route ABCD has cost 5. But you can only discover that if the cost function you want to optimize is used as the cost function for the routing algorithm. If you use path length as the routing criterion, ABD will look better than ABCD.

I think finding the best route in this case is important.

momerath42 commented 7 years ago

I should have qualified 'distance-vector' - in the classic algo, you throw away routes of greater distance/cost. I'm suggesting that connectors maintain knowledge of the reachability of all destinations for all peers (which they do now). Which of those peers to choose for any given destination+amount, at any given time, would be solved by a separate system, which takes into account (possibly out of date at times) quoting data, as well as a reliability metric, and possibly other considerations.

momerath42 commented 7 years ago

Sorry, was trying to clarify further, decided I wasn't making sense, and clicked close, thinking it was cancel. Hopefully I'll be able to write up something more complete and lucid in the morning.

michielbdejong commented 7 years ago

So what, concretely, would be the cost function of the distance-vector routing step? Just reachability of a destination? (cost of a route is 0 if reachable, infinity if not)?

And what would be the search algorithm for the quoting step? try out all combinations, or some heuristic for exploration/exploitation?

emschwartz commented 7 years ago

It sounds like there are still unanswered questions about a) our medium to long-term quoting/routing strategy and b) how we should improve the implementation in the short term.

Clarifications (this is how I understand the following terms, I'm worried we may be using different definitions of the terms):

Route discovery - how connectors find out about available routes to a given destination.
Route selection - how connectors determine what next hop(s) they should use for a given destination prefix. Different connectors may use different algorithms, that take into account different metrics, but we need to agree on what to implement in the reference implementation.
Routing protocol - the data format connectors use to send one another information relevant for their route discovery and/or route selection algorithms. There is currently a debate about what information should be included in this.
Quoting algorithm - how connectors answer quote requests. Connectors may use information cached locally or remote quotes to other connectors to answer requests. Like the routing algorithm we need to agree on what to implement in the reference implementation.
Quoting protocol - how you ask a connector for the rate or liquidity curve to a specific destination prefix. This is something that must be standardized early and should be very simple.

I would summarize @michielbdejong's perspective as follows:

The route selection algorithm must involve the cost, because that is what users of it will care about and we cannot rely even on an honest connector to consistently maintain the cheapest path. Path MTU is not a good parallel because that does not change the route you want to take, only what fragmentation you use.
If we switched from "routing" to "quoting" with quote caching, we would end up needing push-based curve broadcasts because stale cache entries or paths that have gone bad or gotten too expensive would case failed payments. You would refresh your cache when the payments fail but it would result in a bad user experience.
If we have quote caching and then re-add a push-based update mechanism to allow connectors to proactively notify peers about liquidity or rate changes, we would effectively be back where we are now but we would have renamed "routing" to "quoting".
Right now our implementation has issues but those are the result of bugs that we should fix rather than an indication that the algorithm is fundamentally flawed.
The simplest path forward is to fix the current bugs (like https://github.com/interledgerjs/ilp-connector/issues/338) and answer quotes from the cached curves when they are available.

My understanding of @justmoon's and @momerath42's perspective is:

Routing and quoting in the ilp-connector are currently broken and the routing algorithm is fundamentally flawed.
Connectors should answer all quotes using remote quotes to other connectors. This is a fundamentally simpler starting point and we can add other optimizations later.
When network traffic is a problem connectors can cache liquidity curves from quotes and possibly implement active prefetching (using connector-initiated quote requests).
If the routing protocol includes liquidity curves and the routing algorithm is based on that only, as opposed to number of hops, we will run into accidental and malicious routing black holes.
Even if the routing protocol included liquidity curves, the quoting algorithm still should not use them.

My questions about @justmoon and @momerath42's perspective:

How do we know the current routing/quoting algorithm is fundamentally broken rather than the issues we're seeing being a result of bugs in the implementation?
How would you deal with significant changes in rates or liquidity that happen before caches are invalidated? Would this need push-based updates?
Do you expect connectors to ask for remote quotes from one or multiple connectors?
Would we cache liquidity curves in the same data structures we currently call routing tables?
Is there a substantive difference between the routing protocol including liquidity curves and the quoting protocol including push-based updates?

adrianhopebailie commented 7 years ago

Edited: Route discovery -> route selection below

Great summary @emschwartz!

I would add, and I think @justmoon made this point, that routing data and liquidity data have very different update frequencies. This is further motivation to use different protocols to exchange this information.

It's possible (without curves being exchanged at all) for a node to use a Routing protocol to determine which of it's peers give it access to which areas of the ILP address space. This doesn't mean it's able to do full Route selection yet, because as @michielbdejong points out, cost is an important dimension to consider in deciding where to actually send a payment.

But, with a routing table populated with just physical route data, a sender can then issue quote requests to all of the peers it considers worth asking for the cost to deliver a payment. This may become inefficient and there is a chance that cheaper routes are missed because they are filtered out before quoting but trying to push liquidity curves into the route data seems like a premature optimization.

Rather, with this simple model working we can introduce extensions to the quoting protocol that return curves so that a quote requestor is able (if it chooses) to cache data for future quotes.

TL;DR: Routing protocol should be optimized so that processing, aggregation and rebroadcasting route data can be optimized within connectors. Exchanging liquidity curves should be seen as an optimization of the quoting process, possibly introducing a new connector-to-connector protocol for this purpose.l

momerath42 commented 7 years ago

I object to the characterization of our current Route discovery or Routing protocol as fundamentally broken. The behavior I intended it to have, when I was asked to make it work and scale well enough to do a 30 ledger demo, without it falling over, is in place and working reliably.

I agree that removing cost (and other) information from the route-discovery & routing-protocol layer does punt on the route-selection problem, and that a solution to it at the quoting layer, in some designs, starts looking like a distributed routing algorithm. But, my claim is that we don't have a distributed multiple-metric Route-selection design that would scale. The only designs I see working are heuristic, and whether quotes are push or pull is dependent on the workload of the connector and its peering relationships.

michielbdejong commented 7 years ago

I object to the characterization of our current Route discovery or Routing protocol as fundamentally broken.

Well, the current route selection algorithm (as in, actual current master branch of this repo), from what I understand, can be characterized as 'gratuity-based': it's like choosing a restaurant that includes a gratuity in the bill. You choose the restaurant based on the prices listed on its menu, but the price you end up paying is essentially unrelated.

The current plan, from what I understand, can be characterized as 'cached quotes' route selection: we remove the liquidity curves from the route broadcasts, and we choose a cost function (for instance, number of hops) which is unrelated to monetary cost. Then, routes are not used to decide which path to take; instead, they are only used to decide along which path to quote, and cached quotes essentially become a renamed version of our current routing tables.

IMHO this plan will take a lot of refactor work, and not having push (broadcast) messages will mean information will travel slower and less efficiently through the network.

I don't like this plan. I think, instead, we should revert https://github.com/interledgerjs/ilp-connector/pull/206#issuecomment-300741626 and use the liquidity curves (which are already implemented) both for route selection and for quoting. That way, the 'gratuity' system is removed, and connectors can compete with each other using market forces.

Note: this is under the assumption that our reference connector can be used on open networks, and not only on closed network where network participants are vetted.

momerath42 commented 7 years ago

From my perspective, the "current plan" includes a heuristic route selection mechanism, utilizing quotes from multiple peers (but not every peer every time). This does not mean choosing a cost function unrelated to monetary cost; if a peer's quote is expired, but we use it because they've been the cheapest and most reliable, and it turns out more expensive than expected, the heuristic will take that into account.

Keeping liquidity curves up to date for all possible routes is, I believe, requiring too much work of low-volume connectors, and limiting the scale at which high volume connectors can work (else they'll limit the number of peer,destination pairs they have to deal with).

michielbdejong commented 7 years ago

Talking strictly about the open Interledger, let's try to come up with some numbers of how impossible it would be to maintain perfect routing info: say there are 20 nodes, and each node has 5 trustlines. Say each trustline changes three times a day - maybe once due to fixer.io exchange rate changes, and twice because balances changed so much that the liquidity cap had to be adjusted.

That means each node sends out only 3x20x5=300 messages a day, and receives only 3x20x5 messages a day - so each ilp-kit needs to send and receive roughly one message every 5 minutes? That's still a lot less than the current heartbeats every 30 seconds.

IMHO, let's just try to get ilp-kit working and do a 2.1 release on Wednesday. Then if new bugs or scalabilty problems show up, we'll roll with the punches?

momerath42 commented 7 years ago

Why would there only be 20 nodes? I thought we were talking about something that people adopt. I'm also looking toward a competitive environment, where rates are changing much more frequently.

As far as a 2.1 release, what are you envisioning it including? You make it sound (to me) like we already have what you describe.

michielbdejong commented 7 years ago

Off the top of my head, the current list of things that need fixing, is something like:

[ ] never accept a route to a destination which you know is local to you
[ ] never broadcast a route to that route's own nextHop ("split horizon")
[ ] remove the "a gratuity will be added to your bill" thing
[ ] when adjusting curves to liquidity, adjust them to 50% instead of 100% of total available liquidity; that way, a small decrease in liquidity doesn't immediately trigger a need to change the routing curve
[ ] after each balance change, recalculate the curves that have that ledger as nextHop
[ ] some heuristic to aggregate updates over time, for instance, delay forwarding of a route's updates until at least 30 seconds after the previous update

Maybe we can fix half of these points for 17 May, and then we leave the other half for 31 May. In any case, let's at least try to have all our major routing/quoting bugs fixed before the workshop?

momerath42 commented 7 years ago

My list of things that need doing is shorter:

never broadcast a route to the route's next hop (assuming this isn't already checked; I thought it was)
a (simple) route selection heuristic, to avoid the 'gratuity' exploit, and take into account reliability as well as cost

and then at our leisure:

remove curves from route broadcast
quote caching
including hops in route broadcast to simplify loop-avoidance

justmoon commented 7 years ago

I think, instead, we should revert #206 (comment) and use the liquidity curves (which are already implemented) both for route selection and for quoting.

That is what we used to have. At the time we implemented it we already knew that this approach was not going to be viable (but chose to implement anyway planning to improve it later), because of the following issue:

If the liquidity curve is used as the only routing metric, that means that any connector can take the whole network offline by publishing unrealistically good offers. Since it is not trivial to compare asset value across ledgers, it is very difficult to detect which connector is causing the problem. For instance, supposed you normally route payments from the US to Luxembourg using the following path:

(USD) 100 -> 110 (EUR)

But now a new path is advertised:

(USD) 100 -> 99 -> 5399 -> 22 -> 124 (EUR)

Which of the arrows represents the connector that is giving an unreasonably good rate? It could be the 5399 -> 22 because 22 bitcoins for 5399 XRP is an amazing deal. Or that could be a terrible offer if it is actually 22 JPY for 5399 Swiss francs and the buggy offer is really 5399 Swiss francs for 99 US dollars. Doesn't matter, all payments regardless of source and destination will be routed to the buggy connector and will fail.

When we visited the IETF in Berlin, we sat in on a routing working group meeting. As they were discussing different proposals, one of the evaluation criteria was the "blast radius" of the routing protocol. I.e. How much of the network can a single screwup take down?

If we route purely based on liquidity curves, our blast radius is infinite. One poorly configured connector will take down the entire network until it is depeered by all of its peers.

How does shortest path routing help? With shortest path routing, a poorly configured connector must first be on the shortest path before it can negatively impact the routing. That means connectors that are very central can still do damage, but not as universally as before. And connectors that are peripheral can only damage their local environment. You would expect (and experience with Internet routing has shown) that larger, more central routers tend to screw up less, because they have more business at stake. As a result, this change actually provides a disproportionate benefit in terms of stability.

But isn't it less competitive?

The route selection algorithm must involve the cost, because that is what users of it will care about and we cannot rely even on an honest connector to consistently maintain the cheapest path.

True, cost is one of the metrics users will care about alongside success probability, latency, privacy.

Even without considering blast radius, the length of the path is a great metric to start with, because it is highly correlated with all of these, all else being equal:

Shorter paths will be cheaper.

On the Ripple Consensus Ledger the cheapest path was practically always in the set of shortest liquid paths. Which makes sense when you think about it. Every time you convert assets, there are risks, inventory costs, adverse selection costs etc. So if you want to save money on a payment, reducing the number of hops is one of the most important things you can do.
Shorter paths will have lower latency.

Should be pretty obvious.
Shorter paths will be more reliable.

Fewer hops means fewer systems that have to be online.
Shorter paths will be more private.

Fewer nodes will see your payment.

I'm not saying that a shorter path will always be cheaper, faster, more reliable than any longer path. But the best path will often be in the set of shortest paths and it will almost always be in the set of shortest paths and one hop longer paths.

So one compromise I would propose would be to route based on a cost metric, but considering only the shortest paths and, if we can figure out the right algorithm for that, one hop longer paths as well. That limits the blast radius while still allowing competition over cost.

START INTUITIVE OPINION THAT I DON'T EXPECT TO BE ABLE TO CONVINCE ANYONE OF (YET)

That said, there are also reasons for not including cost in automatic routing at all. As a connector I want as much traffic as possible to be routed through me. So I'll probably try to manipulate the routing protocol to send me more traffic. Which means I might lie about the rate. Now, you may say that that's a reason to use the routing protocol's liquidity curves in lieu of quoting, but that's also a mistake. Yes, payments will fail then, but I still won't know if payments through my peers fail because the rate was wrong or because someone is intentionally sending failing payments to make a certain connector look bad. Either way, I'll have to manually intervene.

So how to create competition to make things cheaper? Well, how did competition happen on the internet? Routing a data packet has a cost just like a money packet does. And the competition to drive down this cost to near zero came because people continued to peer with cheaper and cheaper peers. Over time, this lead to competition and a reduction in cost. So perhaps our protocols need only be designed to maximize the chance of payment success and we should avoid adding features that aim to help competitiveness, but only end up adding complexity and creating attack vectors.

Part of this could be that you charge peers for liquidity and not for individual payments. I.e. I grant you a certain "bandwidth" for doing micropayments and that's how I make money instead of charging a fee on individual transactions.

END INTUITIVE OPINION THAT I DON'T EXPECT TO BE ABLE TO CONVINCE ANYONE OF (YET)

Path MTU is not a good parallel because that does not change the route you want to take, only what fragmentation you use.

But that's exactly my point - quoting should not affect the route you take.

If we switched from "routing" to "quoting" with quote caching, we would end up needing push-based curve broadcasts because stale cache entries or paths that have gone bad or gotten too expensive would case failed payments. You would refresh your cache when the payments fail but it would result in a bad user experience.

That's a strawman. I never suggested a push-based quoting system and in fact such a system would be a terrible idea. But let's roll this back a bit and discuss.

We started out with a routing system that prepopulates liquidity curves in the routing table. These curves can in theory be totally up-to-date. But that creates a serious problem: The system becomes insanely chatty. Every time the rate changes anywhere, we flood the network with thousands of route updates. Even though the routes haven't changed. Many updates that affect the rate don't affect the route.

If we are purely concerned with routing, we can even make further trade-offs. We care about least-cost routing, but it doesn't have to be perfect. If a new connector comes along that offers cheaper rates, it's fine if it takes a while for traffic to be directed her way. It's only in the opposite case when a route becomes unavailable, that we MUST send an update right away. So we can reduce the number of required updates even further.

Now let's talk about quoting. Before we introduce caching, let's first consider the base case: We ask for a quote every time. There is no staleness issue, every payment is preceded by a quote request. This works fine, but there are a few cases where it's suboptimal. For example, if I make a hundred payments in a row very quickly, probably the rate doesn't change that much, so it's quite noisy to quote every time.

So we introduce a caching feature. But of course we don't want to bring back the staleness problems that we had when quoting was tied in with routing. So we ask connectors to include an expiry date in their quote responses. This is the amount of time that they promise to honor this quote. For some paths with volatile liquidity, the expiry date may be short. For others it can be longer. In any case, it's chosen to offer the greatest amount of caching while preventing staleness.

If we have quote caching and then re-add a push-based update mechanism

Quote caching would be purely caching, not push-based.

Right now our implementation has issues but those are the result of bugs that we should fix rather than an indication that the algorithm is fundamentally flawed.

That's incorrect, right now our algorithm can be trivially exploited to take down the whole network, simply by advertising a very good rate.

Switching to shortest-path-first routing is necessary to solve this problem, but not necessarily sufficient. Malicious nodes could still lie about the path behind them. They couldn't lie about the path in front of them, so just switching to shortest-path-first may be enough - if not, we would need further changes like authenticated routes.

How do we know the current routing/quoting algorithm is fundamentally broken rather than the issues we're seeing being a result of bugs in the implementation?

If I had the time, I'd demonstrate it by taking down the network. But it's fairly easy, just comment out the actual payment processing code in your connector and hard code it to use a highly negative spread, e.g. -1000000%. The routes your connector puts out will be so good that everyone will want to route everything through you. Even if I'm peered with a connector that has a direct connection to ledger X, it'll still be worth going through the subsidized broken connector first, so I'll route almost everything that way.

Do you expect connectors to ask for remote quotes from one or multiple connectors?

One connector. Routing (finding the shortest + cheapest path) is the job of the routing protocol.

Would we cache liquidity curves in the same data structures we currently call routing tables?

I don't understand the question.

A routing table is a mapping prefix -> [next-hop, next-ledger, is-local?, expiry-date]. A cached quote is a mapping prefix -> [liquidity-curve, expiry-date]. So yes, there are obviously similarities, but no, they are not the same. For one, I may use a lower granularity of prefixes for routing than for quoting as I mentioned in my previous post.

Is there a substantive difference between the routing protocol including liquidity curves and the quoting protocol including push-based updates?

Not really. I'd expect both to be too chatty as to be viable options.

momerath42 commented 7 years ago

Do you expect connectors to ask for remote quotes from one or multiple connectors?

One connector. Routing (finding the shortest + cheapest path) is the job of the routing protocol.

This seems to contradict earlier statements about the routing protocol. My view is that the routing protocol doesn't particularly care about "shortest" paths, except to prevent loops. Instead of the classic *-vector algorithm, where longer routes are discarded, I suggest (as we do now) that connectors maintain knowledge of the reachability of all destinations through each of their peer connectors, and that on occasion, depending on the heuristic, more than one connector is asked for a quote.

michielbdejong commented 7 years ago

one compromise I would propose would be to route based on a cost metric, but considering only the shortest paths and, if we can figure out the right algorithm for that, one hop longer paths as well. That limits the blast radius while still allowing competition over cost.

I agree blast radius is indeed an important concern. Just practically though, when you get a route broadcast that includes a field num_hops: 1, how do you know whether this is actually really true or not? Maybe this is a lie, in an attempt to obtain traffic, and the number of hops is really much higher?

quoting should not affect the route you take

Ah ok, reading this as "the price of a route should not automatically affect whether or not you take it", that changes everything. I'll think more about this tomorrow!

michielbdejong commented 7 years ago

I suggest (as we do now) that connectors maintain knowledge of the reachability of all destinations through each of their peer connectors

What would the internet-equivalent of the "reachability" of a given IP address from a given switch? Wouldn't reachability (almost) always be 100%?

momerath42 commented 7 years ago

What would the internet-equivalent of the "reachability" of a given IP address from a given switch? Wouldn't reachability (almost) always be 100%?

For this analogy, the question would be reachable on which port? For the vast majority of hosts on the net, and the routers that serve them, there is a clear distinction between upstream and downstream. Multi-homing is the analogous situation to a connector with more than 2 peers, and routing protocols generally don't play a large role in multi-home situations (traffic is usually balanced across them, or the extra links are there for failover).

michielbdejong commented 7 years ago

On the plane yesterday, Ben and Evan explained to me how the num_hops can work: when forwarding a route, you know the number of hops will be at least you and the next connector. Therefore, a connector can only lie about the number of hops behind it, not about the number of hops in front of it, and that's why the measure still makes sense.

one compromise I would propose would be to route based on a cost metric, but considering only the shortest paths and, if we can figure out the right algorithm for that, one hop longer paths as well. That limits the blast radius while still allowing competition over cost.

So then I agree with Stefan's compromise proposal. If I may try to recap what I think the plan for this would be:

We add num_hops to the broadcasted routes, to limit blast radius
We change the connector back so that it uses the curves from broadcasts again
We see how this performs on the open Interledger, fix bugs, and iterate

Let's discuss it more in the planned meeting, 4pm Luxembourg time today.

dappelt commented 7 years ago

one compromise I would propose would be to route based on a cost metric, but considering only the shortest paths and, if we can figure out the right algorithm for that, one hop longer paths as well. That limits the blast radius while still allowing competition over cost.

As @justmoon already pointed out, selecting a route only amongst the shortest paths potentially limits competition for exchange rates amongst connectors (e.g. think of ILP kit, where direct peers could charge an arbitrary rate). Therefore, we should come up with a (simple) heuristic that considers the tradeoff between cost and number of hops. For example, a weighted sum of cost and hops. Cost and hops would need to be normalized (i.e. in the interval [0;1]) w.r.t. to an optimal route and assigned a weight. Optionally, we could add a hard constraint that discards any routes that have more than x hops than the shortest route.

Eventually, I think we will see custom connector implementations being developed that will optimize this heuristic to maximize profit and they will likely have a rich data corpus to do so. At the current stage, IMO we should focus on stability and prevent attacks related to the blast radius (e.g. blackholing).

michielbdejong commented 7 years ago

I think # hops only limits the blast radius if it's an absolute criterion; if you still allow more hops for "very" cheap routes, then attackers would just adapt to that, and make their blackhole routes so incredibly cheap that it blows the # hops criterion out of the water.

So I agree with @justmoon's proposed compromise: route based on cost, but don't use routes that are more than 1 hop longer than the shortest route.

emschwartz commented 7 years ago

@justmoon @momerath42 @michielbdejong @adrianhopebailie @dappelt @sharafian and I had a call about this and framed the debate in the following way.

TL;DR: There is little debate about the ILQP and CCP protocols but still debate about how the reference connector implementation should use CCP, answer quote requests, and route payments.

Interledger Quoting Protocol (ILQP)

This includes:

Quote for fixed source amount
Quote for fixed destination amount
Quote for liquidity curve

The quoting protocol should not change from this and we should consider the binary format @sentientwaffle has recently been implementing pretty much done.

For clarity’s sake we should only talk about “quoting” as the process of asking a connector for the rate to a given destination. How a connector decides to answer a quote request is not part of the quoting protocol.

Connector-to-Connector Protocol (CCP)

This is a protocol that connectors use to communicate with one another.

Based in part on @justmoon's conclusions and a suggestion from @bbock, this protocol should probably remain JSON or at least a format that allows arbitrary key-value pairs to be added. Connectors will likely have different policies for routing payments and the communication protocol should just make it easy to include whatever information connectors want in there.

ilp-connector implementation

Even if in the future everyone will be running custom connector implementations with proprietary logic, we need something that works out of the box now. Given the above conclusions about the protocols, most of the current debate is about how the reference implementation of the connector should work.

Options for what our implementation puts into CCP and expects to find in there:

Reachable destination prefixes
Number of hops
List of autonomous systems to destination (ledgers or connectors?)
Liquidity curves to destination prefixes
Max liquidity (bandwidth)

Options for when CCP broadcasts happen:

When liquidity curve changes
When routes (e.g. number of hops) change

Options for how the connector answers quote requests and routes payments:

Remote quote to a single connector with the shortest path to the destination
Remote quote to a single connector marked as the best in the routing table by a statistics-based heuristic or machine learning algorithm
Remote quotes to multiple connectors with either the shortest path or one hop more
Cached remote quotes
Liquidity curves broadcasted by peers
Pluggable module that can manipulate the routing table so everyone can implement their own prefered solution (this still leaves the question of what information is included in CCP)

Options for how routes and curves are stored:

Current routing table design that is a map of destination prefixes and liquidity curves
Routing table only stores destination prefixes, next hop, and number of hops and there may be a separate remote quote cache

Scalability considerations

Assume 200 total nodes, 5 connections per node
Blast radius of attractive rates
Blast radius of liquidity curve updates
Network chattiness
Malicious connector attracting all traffic
Malicious connector making another connector look bad
Design for micropayments (maybe don’t need better paths for different size payments or liquidity curve combinations)

bbock commented 7 years ago

I’m new to Interledger, so please take my comments with a grain of salt ;)

Metric

The routing protocol should first and foremost be concerned with maximizing reachability while minimizing cost to do so. Reading through this thread, it may very well be that the metric for “cost” is not yet clearly defined, whether it is based on liquidity, path length or some combination of both.

The metric is the primary goal we want our network topology to optimize for. In a payment network, I guess one wants to optimize for best exchange rates. Path length is a secondary factor which one mainly cares for due to the fees each connector will add. Therefore, I’d use the liquidity curve as routing metric, although it increases the complexity of the routing (not just a single number but a curve, frequent updates).

I’m unsure whether @justmoon’s statement about the correlation of cost and shortest path will hold true in the future, given that there may be different notes specializing on certain aspects like success rate and latency (premium service), cost or other requirements like conformance to specific regulations etc.

Policy

As I discussed with @emschwartz during the Munich meetup, the routing protocol should be extendable with further information, much like BGP communities. Keeping the protocol open for further development seems to be essential in this early stage.

Conveying policy information together with routing information will enable connectors to develop specialized products. Given that this is future development, I’d rather not specify this now in detail but keep the protocol open for policy extensions. An extendable format like JSON seems to cover this.

Regarding local policy: Local policy is everything that diverges from the ideal state of the network. Possibly a connector needs some kind of policy due to legal reasons. Commercial connectors will have needs for maximizing profit by proprietary “optimized” forwarding.

Long term, you probably want a plug-in mechanism for modification of forwarding information in the connector.

Chattiness

Of course, due to exchange rates being more volatile than pure connectivity, the number of routing updates is a challenge when compared to the internet. As already discussed, minor updates of the liquidity curve shall not change the routing table to keep the amount of routing updates under control. So, if you take the liquidity curve as the metric for routing, you will want some mechanism to prevent a minor update from propagating through the whole network.

This will always be a trade-off between chattiness and stale routing information. The consequence of course is that the routing information is stale / unreliable for quoting. You can either

not use routing information for quoting, or
optimize the frequency of routing updates equally for the number of messages and the information not being stale, while making sure this is not exploitable by a connector.

As option 2 sounds hard to me, I’d go for option 1. ;)

BGP handles this issue with a mechanism called route dampening to limit overly chatty neighbors. A similar feature should also take care of flapping routes due to bandwidth / liquidity problems.

If we need to cache the number of quotes done by the connectors, this can and should be solved independently from routing.

Blast radius

I wouldn’t care too much about malicious connectors, because you need to have an agreement with your neighbours anyway, probably including some legal contract. In BGP, there’s no peering without a commercial agreement either.

Accidents are much more likely to occur than malicious connectors.

Any connector should enforce the agreement with appropriate route filters. This is best practice in BGP for a long time. If the connector isn’t enforcing that, it is risking its own reputation. If routes are properly filtered on the ingress, misbehaving connectors are much more unlikely to have a huge blast radius. I’d add route filtering to the reference implementation, it’ll increase the stability of the network a lot.

Filtering is of course most important on the edges of the network, as @justmoon already said, because core routers/connectors are likely to be much more stable than edge connectors.

Also, another building block to manage this might be local policy again. High quality connectors may choose to evict routes automatically when they have an increased failure rate.

adrianhopebailie commented 7 years ago

@bbock really useful feedback thanks. Filters will be a great addition to the connector functionality.