Add idempotence/retry behavior to ILP-over-HTTP

kincaidoneil commented 4 years ago

ILP-over-HTTP has a known issue that if the response containing a Fulfill packet is dropped (e.g. due to a network error), there's no mechanism to retry it, so the balances between the two peers will get out of sync and require manual reconciliation. Also, if the request with an ILP Prepare is dropped, it won't be rejected until the packet expires, which is undesirable since payments may take longer.

This proposes a backwards compatible change to ILP-over-HTTP to separate the Prepare and Fulfill/Reject into two separate HTTP request/responses. Each packet has a corresponding idempotency key enabling clients to safely retry packets without triggering the same side effect multiple times.

The flow looks roughly like:

Alice sends POST with the ILP Prepare to Bob.
Bob begins processing ILP Prepare and responds with a 200 status. The response is dropped due to a network error.
Alice doesn't receive any response after 5 seconds (for example), and retries sending the Prepare.
Bob ignores the Prepare since the request has an idempotency key he's already seen, and responds again with a 200.
Alice gets the response, and knows no further retries are necessary.
Bob gets the corresponding ILP Fulfill, and sends a POST to Alice with the Fulfill packet and same idempotency key as the ILP Prepare.
Alice gets the request and responds with a 200 status.
Bob receives the 200 response and knows no retry is necessary.

matdehaast commented 4 years ago

Async Request/Response!

That is how the Mojaloop API handles all REST API calls. I will say it does make life trickier for development and debugging. But alleviates all the statefulness of the actual request, which allows for better Load Balancing, queuing and generally a more scalable architecture IMO.

I am really in favour of this because when I think a packet goes 3-4 hops, then essentially you have this long chained sync http connection over multiple hops that could be the source of any number of errors.

kincaidoneil commented 4 years ago

One caveat to the backwards compatibility: AFAIK, there's no safe way to retry sending ILP Prepares to a node supporting the old version of the spec without them processing the Prepare multiple times (right?). If the Prepare is retried as a result of a network error, this is probably fine, since it's no worse than the current version (money could be lost)! However, if the client is retrying sending the Prepare e.g. every 5 seconds as long as it hasn't received any response to the original request, the older node would process the same Prepare multiple times.

I added a note that nodes must know whether their peer supports draft 3 or later before retrying sending Prepares, but how should this be implemented? A per-account flag?

https://github.com/interledger/rfcs/compare/060732f4f52f286c020c9ebf085b453e91e51951...f4fc0020d82b46b6fdfde537a8f251906d95120c#diff-ccfda9b7df50a9cfd8618bb9c08e360aR116

sharafian commented 4 years ago

If we want to smoothly upgrade this we could add some version negotiation. you can carry an accept header on the prepare that informs whether you're OK with an async response, and the response will come back with a content type which communicates whether it's just an empty ACK or whether it's carrying the fulfillment.

I do think we should be careful about retrying packets at the link layer though. Imagine there's a network error preventing your fulfills from getting back to me (like my load balancer isn't accepting traffic from you but yours is OK). My connector is going to be accumulating all of these prepare packets that it continually retries until they time out. And the original sender is probably going to be sending more packets too. Your systems will be trying to send back an increasing number of fulfills. It could lead to a big slowdown where everyone gets mucked up with packets that keep getting rejected.

It might be good to investigate the prevalence of this error before we go about solving it. If you can measure how many packets you never got a fulfill/reject for, it would inform what we should be solving for. If it never happens in practice, then maybe it's not worth adding the memory footprint of tracking idempotence IDs for 100's to 1000's of packets per second. And if it happens a lot, maybe there's some network configuration upgrades we can do instead of a protocol change.

All that criticism is separate from splitting Prepare from Fulfill and Reject. I think an async flow will probably have benefits separate from idempotence (like Matt said, it leaves much fewer hanging connections). Although it would break Coil's setup, because we horizontally scale our connector and having the fulfill go to a different shard than the one the prepare came from will make it stop working. Does the rust implementation have that issue too when sharded?

sentientwaffle commented 4 years ago

If no HTTP response is received within a given timeout, clients SHOULD retry sending the packet with the same idempotency key.

When the sender has gotten a 200 from the receiver (but has not received a corresponding Fulfill or Reject), why is it the sender's responsibility to retry? The receiver already has the Prepare, and could keep retrying their fulfill/reject request until geting a 200 back from the original sender.

Aside: as both parties of a payment are simultaneously HTTP clients and servers it would be clearer to refer to them in the spec as "sender"/"receiver" rather than "client"/"server".

kincaidoneil commented 4 years ago

When the sender has gotten a 200 from the receiver (but has not received a corresponding Fulfill or Reject), why is it the sender's responsibility to retry? The receiver already has the Prepare, and could keep retrying their fulfill/reject request until geting a 200 back from the original sender.

@sentientwaffle Yep! (The behavior you describe was how I intended it to be read, but I agree that "client" and "server" were unclear). I updated the language with "sender of ILP Prepare," "recipient of ILP Fulfill/Reject," etc. to try to better clarify this.

Until there's broad consensus to deprecate Draft 2, I think this RFC should support both sync and async modes, or if there disagreement there, perhaps we should define this async behavior in a new specification.

@sappenin I reorganized and clarified the doc to better differentiate between the two modes and that they're both supported.

Although it would break Coil's setup, because we horizontally scale our connector and having the fulfill go to a different shard than the one the prepare came from will make it stop working. Does the rust implementation have that issue too when sharded?

Good points @sharafian. Definitely agree it's worth measuring how prevalent this is, though even a very low failure rate could be costly in the time required for manual reconciliation. Re: horizontal scaling: the async mode does seem to complicate this. There's been a little discussion about adding crash recovery to Rust; if in-flight packets were written to e.g. a Redis store shared between instances, that might make it possible to support.

I think this is ready for re-review when/if anyone gets a chance!

adrianhopebailie commented 4 years ago

This is a neat way to solve the issue, and as @matdehaast points out, this is the pattern we've seen used for Mojaloop.

I do wonder if the async pattern is a requirement for idempotency? It seems a pity to lose the one aspect of HTTP that was especially useful, request/response matching.

What if we simply added the idempotency key and expected a client to terminate the request and retry with the same key if it doesn't get a response?

If you use HTTP/2 this would only cost you a new stream which is very cheap.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

kincaidoneil commented 4 years ago

I made some changes to this proposal:

Added a `Callback-Url` header to the request

Previously, the idea was the callback URL for asynchronous replies would be statically configured for each peer. However, that complicates the implementation if the ILP Prepare sender operated multiple nodes behind a load balancer: they must cache which node handled the original ILP Prepare, in order to direct the ILP reply to the correct node. By including a callback URL in each individual HTTP request, this requires no static configuration, and enables the ILP reply to be sent directly to the node instance which sent the ILP Prepare and has its associated state available (they can still choose to use a load balancer).

Removed idempotent ILP Prepare retries (for now)

Retrying Prepares with idempotence requires the ILP Prepare receiver to operate a cache of idempotence keys, and is a breaking change for existing peering relationships (both peers must enable ILP Prepare retries for it to be safe). Note, however: this doesn't apply to idempotent retries on the ILP reply leg, which are still supported here.

I do still think we should add this feature, eventually. The downside of not retrying Prepares is if a request fails due to a transient error, the sender must wait the full timeout until they can retry it. For retail payments, the payment will take an additional 30 seconds (since holds prevent unsafe amounts in-flight). Also, for example, the pay STREAM implementation would terminate the payment, since it took too long -- which I believe is the right behavior.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is important, please feel free to bring it up on the next Interledger Community Group Call or in the Gitter chat.

kincaidoneil commented 3 years ago

@matdehaast For backwards compatibility, if the receiver doesn't support async, they just handle the request synchronously. The sender can identify which mode the receiver chose based on the status code of the response (200 vs 202) or if the body is non-empty.

interledger / rfcs