ietf-wg-httpapi / idempotency

Repository for "The Idempotency-Key HTTP Header Field"
Other
15 stars 9 forks source link

Clarify whether the idempotency service should record requests for which the backing service returns error #41

Open jmileham opened 3 months ago

jmileham commented 3 months ago

Per this comment, beginning a new issue to track this discussion.

@mikekistler's reading of the spec is that the idempotency key should be recorded for all errors on the service side.

Mike proposed a clarification following @richsalz's prompt.

@richsalz noted there could still be ambiguity

This issue can hopefully be used by contributors to pin down any desired language changes to clarify the spec. I'll also follow up in a standalone comment below advocating for why in my team's experience it'd be valuable to have the spec recommend not recording an idempotency key when the resource server returns 4xx.

jmileham commented 3 months ago

Case for not recording idempotency keys when the resource server returns 4xx:

The basic principle driving this case is distributed system eventual consistency and self-healing. If a server caches a client error, subsequent retries of the request can never be successful. Client errors may be due to ephemeral factors. In an eventually consistent system, a necessary privilege might not not have propagated yet by the time a request comes through. Or in a financial context, the necessary balance to complete a transaction might not yet be available.

Well-designed stateful distributed systems will typically have durable job runners with automatic retry to ensure that unsuccessful operations are eventually completed successfully, while also attempting to avoid unintentional denial of service on dependency services via exponential backoff. Caching client errors would require the client to take on a new concern in order to automatically achieve eventual consistency, even if an ephemeral client error was already resolved. The client would have to, upon receiving a 4xx response, re-enqueue an equivalent job with a fresh idempotency key in order to be successful. This would in turn likely interfere the exponential backoff semantics of a job runner and require manual/framework intervention.

Conversely, if we can take a 4xx error to mean that no side effect took place, the idempotency service can safely not persist the idempotency key (or delete a provisionally persisted idempotency key). This should be as semantically safe and take cognitive/framework load off of the client.

jmileham commented 3 months ago

Also, to a lesser extent, it might also make sense to consider resource server 500 errors as side-effect free and potentially ephemeral/retryable (e.g. due to a bug or server configuration issue that could be resolved), though due to the nature of 500 errors, there's no affirmative indication that a side effect did not take place before the error. It would get even dicier to extend such a POV over the whole 5xx space, which includes things like 504 Gateway Timeout, which would seem squarely in "who knows what happened" territory, and couldn't be safely retried. But I think 4xx is a pretty uncontroversial place to start.

asbjornu commented 3 months ago

Having worked in fintech on payment systems for a few years, the semantics we wanted wrt. idempotent requests was:

  1. Requests should be repeatable without causing duplicate transactions,
  2. All requests should be safely repeated until a final response is received. A "final" response would typically be anything in the 2xx range and 409 Conflict (in the case the request had already been received and successfully processed without the client receiving the response).

If https://github.com/ietf-wg-httpapi/idempotency/issues/31#issuecomment-2018295187 results in something like 504 to be recorded as the final response to a request, what is the client expected to do? As a client developer, I would not be confident that a transient error is a guarantee that the original request is not processed, and repeating the request would make me none the wiser.

jmileham commented 3 months ago

Requests should be repeatable without causing duplicate transactions,

Agree! I think we all want this quality. :)

If https://github.com/ietf-wg-httpapi/idempotency/issues/31#issuecomment-2018295187 results in something like 504 to be recorded as the final response to a request, what is the client expected to do? As a client developer, I would not be confident that a transient error is a guarantee that the original request is not processed, and repeating the request would make me none the wiser.

I believe the Stripe-style contract currently in the spec, and proposed to be clarified further, would be that the idempotency server would record the idempotency key and replay the initial response on subsequent requests regardless of the resource server's response code, including all 2xx-5xx responses. In order to reconcile any "non-final" responses in the vein you're describing, including 504, it would be an exercise left up to the client.

My pitch doesn't actually differ from the current spec or proposed clarification with regard to 5xx handling (though I did go on to make a half-hearted case for 500 being special, but it's not on very solid ground). In fact 504 for a mutative endpoint is sort of the poster child for "some kind of out-of-band reconciliation is probably necessary here if we want to ensure this happens once but don't want to double-submit."

So to clarify, I'm just proposing that a4xx response should not be cached, because no request resulting in client error should have side effects, making it safe to retry. Neither the existing spec nor my proposed change will provide leverage on what to do in response to 504, but they'd both ensure that the request not be processed twice with the same idempotency key.

asbjornu commented 3 months ago

So to clarify, I'm just proposing that a 4xx response should not be cached…

I think that's fine, but not caching 5xx responses is more important, IMHO.

Neither the existing spec nor my proposed change will provide leverage on what to do in response to 504, but they'd both ensure that the request not be processed twice with the same idempotency key.

That's the exact kind of safety I think the spec needs to be clear about. Any 5xx response, timeout or otherwise lack of clear response from the server that the request has been received and processed can be assumed to be unreceived and thus safe to repeat.

jmileham commented 3 months ago

I'm confused. In my mental model, making 504s not cached by the idempotency service would make the system unsafe against double transactions.

In my mental model there are three parties to the distributed system:

Here's my chain of reasoning:

This safety is provided for all 5xx response codes in both variants of the spec. My pitch proposes removing the caching/idempotency key persistence only for 4xx response codes because they should be inherently side-effect-free, and it enables automatic self-healing under ephemeral client errors.

Does that make sense?

jmileham commented 3 months ago

The dissonance we're experiencing here may arise from an assumption that the idempotency service is transactional with the underlying resource service. I just opened another issue #43 to discuss whether/how the spec should grapple with a transactional variant of the implementation vs a standalone service variant, because a transactional idempotency service in a well-designed application can avoid caching ambiguous outcomes without risking double transacting, which may be what you're looking for @asbjornu?

asbjornu commented 3 months ago

Thank you for making that distinction explicit, @jmileham. I've made my comments over in #43. Regarding the title of this issue, I would vote strongly against the recording of transient errors. If the Idempotency service has no application-level domain knowledge of what constitutes a transient vs. final response status code, I don't see how it can provide a guarantee against repeated requests. See the below sequence diagram for an example:

Sequence diagram of a generic Idempotency service

This is only one of many possible scenarios where the recording of transient failures due to the generic nature of the Idempotency Service leads to an ambiguous, unresolvable state. I don't see how this can be reconciled in the specification without either specifying explicitly:

  1. Which status codes are considered final.
  2. A header that signifies whether a response is final.
  3. That only successful status codes can be recorded.

However, there would be issues with the generic nature of the Idempotency Service still, if it abstracts the Idempotency-Key away from the origin server as a "convenience", making it impossible for the origin server to detect duplicate retries if the Idempotency Service for whatever reason is unable to due to its lack of application-level domain knowledge.

jmileham commented 3 months ago

Thanks for diagramming that out!

To be clear, I agree that there is no generic frameworked solution to ensuring exactly-once execution in the non-transactional idempotency server case. The only options are at-most-once execution or at-least-once execution. And I propose that an idempotency service that is incapable of exactly-once execution semantics (which we agree describes a non-transactional intermediary idempotency service) must choose at most once execution semantics to be useful. The example you diagrammed appears to show the client participating in a protocol attempt that converts an at-most-once execution idempotency service to an at-least-once execution outcome, which obviously violates that guarantee and I wouldn't advise.

It is however possible for a domain-aware client (at the application layer) to work with a non transactional intermediary idempotency service to resolve the ambiguity, assuming it can gather out-of-band (e.g. via another api call to the resource service) whether the transaction it initially requested was completed successfully, and then decide whether to try again with a new idempotency key.

This is what stripe is relying on. My hypothesis is that the bulk of their clients (customers) are operating in batch and have semi-manual, or iteratively developed bespoke resolution paths, if they care deeply about exactly once delivery. I don't love the tradeoffs either, but it's the information theoretical limit if you choose not to have a transactional idempotency service.

My basic point in this issue is that you can reduce the scope of these cached unknowns if you decide that requests returning 4xx are reliably side-effect-free. But you can't eliminate unknowns by naive frameworking. You'll merely end up converting it to at-least-once execution, or implementing a 2-phase-commit protocol (which, also is not guaranteed to resolve under network partition).

Therefore I'm a proponent of not caching 4xx for an intermediary service to increase the self healing properties of the system.

Does that check out?

To bring it back to your newly edited post, the problem cannot be fully reconciled by any of the 3 options you presented, either. You must be willing to cache some non final states in order to operate a non-transactional idempotency service that provides any safety against double transacting.

Your final paragraph implies that the resource server might have internal idempotency capabilities based on reading passed-through idempotency keys, which definitely would solve, but also would obviate the intermediary idempotency service. Having the resource server operate the idempotency service (co-transactionally with the side effects) would be my preference as well. That is what my team has implemented for our own use case, and we do not cache non-final states. I'm simply offering the thoughts I have on this thanks to having gone all the way down this rabbit hole recently.

awwright commented 3 months ago

I see a very simple solution here: The reason you use Idempotency-Key is to prevent a second request from changing the server state again, and an error didn't change the server state to begin with. So Idempotency-Key should not apply to any errors, because they do not change the server state.

Or to put it another way: There are a subset of POST requests that are idempotent, including all of the ones that respond 4xx/5xx.

This would solve the vast majority of the problems associated with a gateway "idempotency service", but not all of them: if the response from the server is interrupted mid-message, then there's no way to recover. Idempotency-Key is best understood as an application header, the same way HTTP is an application protocol.

jmileham commented 3 months ago

@awwright it appears you are stating an oversimplification of what I proposed at the top of the thread as a satisfactory solution. But, as discussed above, the oversimplification isn't sound.

So Idempotency-Key should not apply to any errors, because they do not change the server state.

Or to put it another way: There are a subset of POST requests that are idempotent, including all of the ones that respond 4xx/5xx.

Thinking it through, you will see that especially 504 Gateway Timeout, but practically speaking also at least 500 response codes cannot be relied upon not to represent change in the resource server state (e.g. for 500 the write may have succeeded but there was a bug in the response renderer). Depending on how much you trust resource service authors to comply with HTTP status code semantics, you could probably consider much of the rest of the currently specified 5xx space to be side effect free, but you can't safely include 500 or 504 in that list.

The set of non-success response codes is not the same as the set of reliably side-effect-free response codes. The difference between these sets is what I refer to as ambiguous response codes. An actual network timeout received by the idempotency service itself when making the request to the resource service is also an ambiguous response. A non-transactional intermediary idempotency service must record ambiguous responses and not retry to uphold the only guarantee it can offer: at most once execution.

awwright commented 3 months ago

A 504 (Gateway Timeout) is the same thing as not receiving any response at all, so the same advice applies: don't save the response, though if the request was made with Idempotency-Key, it is safe to retry.

As I pointed out, a gateway service can help reduce errors, particularly if it's very close to the origin, "but not all of them". This is impossible to solve without application integration.

As for bugs in the server—this is by definition a violation of the specification, which all we can say is "don't do that".

asbjornu commented 3 months ago

though if the request was made with Idempotency-Key, it is safe to retry.

While a retry is safe, it only provides value if an erroneous response isn't saved. It has been suggested to always persist the response, regardless of its status code. That effectively makes retries entirely useless – and as such, a generic Idempotency Service rather superfluous.

My suggestion is therefore to never persist a response unless it's successful.

This is impossible to solve without application integration.

We don't need a very deep integration if we came up with a response header that indicates successful processing of the provided Idempotency-Key. Something like Idempotency-Key-Processed: <key> (a bit long, perhaps we should look into the naming of both headers if we go down this path). Encountering such a header would instruct the Idempotency Service to persist the response and always return it for the same key.

As for bugs in the server—this is by definition a violation of the specification, which all we can say is "don't do that".

Sure, but given HTTP's layered architecture, we can't rely on just the origin and Idempotency Service to operate according to spec, we also need every intermediary to do it as well.

It's not uncommon for such intermediaries to do crazy things like HTTP redirects to an error.html page served with 200 OK for all upstream error codes, for instance. Although that's an obvious misconfiguration of said intermediary, not all application firewalls and load balancers are set up and configured by members of the HTTP WG.

Stuff like this happens all the time, and for a header that is sold as a protection mechanism, I think we would do good to reduce the impact of such erroneous configurations as much as possible.

jmileham commented 3 months ago

@awwright

A 504 (Gateway Timeout) is the same thing as not receiving any response at all, so the same advice applies: don't save the response, though if the request was made with Idempotency-Key, it is safe to retry.

This is the logical opposite of correct. Both the idempotency service receiving 504 from the resource service and the idempotency service not receiving a response at all in response to a mutative request means it is not safe to retry. The side effect may have happened. I think we need to all get clear on this point in order to continue this discussion, otherwise we'll just spin around in circles.

jmileham commented 3 months ago

Also to be super clear about my meaning, because I think there's a bit of fast and loose interpretation going on here:

Both the idempotency service receiving 504 from the resource service and the idempotency service not receiving a response at all in response to a mutative request means it is not safe to retry.

It "means it is not safe" for the idempotency service to retry the request against the origin server, which means that the only correct behavior for the idempotency service would be to record the idempotency key and respond with (according to the spec) a cached 504 to the end user.

That in turn means that, yes, there is no way without additional resource-server-domain-aware reconciliation software, to cause the side effect to reliably happen even once with one idempotency key. That is an information theoretical limit. It is meh behavior, but them's the breaks.

It would fundamentally break the safety of the system if the idempotency server didn't cache 504s.

awwright commented 3 months ago

That effectively makes retries entirely useless – and as such, a generic Idempotency Service rather superfluous.

Exactly.

Something like Idempotency-Key-Processed: <key> (a bit long, perhaps we should look into the naming of both headers if we go down this path).

I like this idea. This could serve a purpose similar to the Content-Location header. While this header could also be named Idempotency-Key (so the server mirrors back the header, if it was honored), I don't really like this pattern, as I think field names should be asymmetrical if their meaning is asymmetrical. Perhaps Idempotency-Key-Applied (mirroring Preference-Applied)?

Sure, but given HTTP's layered architecture, we can't rely on just the origin and Idempotency Service to operate according to spec, we also need every intermediary to do it as well.

I broadly agree, implementing a feature shouldn't exacerbate any existing bugs with other parties. Though to be clear on one related point, this doesn't eliminate the problem either, as Idempotency-Key can't (by itself) be a guarantee the request won't be executed twice—it's just a request for that behavior. (That would need a new method.) However, within the scope of what HTTP headers are capable of, yes, I think storing anything but a success will introduce new classes of errors and error handling that didn't exist before, and that is no good.

jmileham commented 3 months ago

Before y'all get too far ahead of things with the agreement can we address the core logic error that has pervaded both of your posts on this thread?

awwright commented 3 months ago

Before y'all get too far ahead of things with the agreement can we address the core logic error that has pervaded both of your posts on this thread?

Could you be more specific? The answer to the question in the title is an unqualified "no."

It "means it is not safe" for the idempotency service to retry the request against the origin server, which means that the only correct behavior for the idempotency service would be to record the idempotency key and respond with (according to the spec) a cached 504 to the end user.

Yes, I acknowledged this. I mentioned that your idempotency service can solve many errors, but not all of them, and that if the response from the server is interrupted mid-message, then there's no way to recover. We may be talking about two different situations, let me elaborate.

First consider the behavior with an idempotency caching gateway, which does not add idempotency but merely caches responses. With a cache, the origin (ideally) never sees the same idempotency key twice, as long as the origin responded 2xx. The only purpose is to reduce load on the origin, it doesn't change the behavior of the application. With this behavior, you should definitely never cache a 504 response, or most errors, these are handled by the origin server, which also implements Idempotency-Key. With application specific knowledge, or with knowledge of HTTP caching rules, it might be possible to cache some errors, like 404 Not Found. But in general, Idempotency-Key does not add the ability to cache errors (beyond what HTTP caching permits), only 2xx, any other behavior would be outside the scope of a cache.

Now consider your application-level idempotency service gateway, which attempts to add support for Idempotency-Key in as many situations as it can, even without support by the origin. In short, you cannot make this safe. You're saying that if the response from the origin is lost before the service, then we may have a mutated state but no confirmation of the change and no response body. I acknowledged this: "if the response from the [origin] server is interrupted mid-message, then there's no way to recover"

The first way you could recover from this is you read the state of the server, determine where you left off, and resume from there. This is the traditional way to recover from a break, but this can be very complicated. Presumably, the point of Idempotency-Key is to not have to worry about this situation, and to make this recovery logic generic instead of application specific.

Second, suppose you have no way of recovering because this information isn't available to the client. Suppose you use a POST request to count every time a door opens, but no other information is provided. Even if the server makes this log available to the user-agent, how do you know which log item was yours, assuming the server received it? There is no way to recover from these situations. (Now I would suggest this is a bad API design, even if you lack Idempotency-Key, you should still implement a field for unique user-data in your POST body; this will just be application specific, instead of generic.) Saying we permanently can never send that request because we're not sure if the server received the request is not really an option.

Now the idempotency service gateway is not without value. It can still offer all of the functions of an idempotency cache. It can also potentially rectify communication errors between the user-agent and the service gateway, which is probably the vast majority of errors (i.e. deploying an idempotency cache gateway in front of your application will provide some value even if your application doesn't recognize Idempotency-Key at all!). But the same potential dangers of POST without Idempotency-Key are not completely eliminated.

There is one more thing. Most Web browsers these days will automatically retry POST if they don't get a response from the server. So if the service gateway automatically retries a request from a Web browser, even to an origin server known not to support Idempotency-Key, this probably won't add any additional problems that the Web browser isn't already causing.

Further, most people will probably not use a idempotency service gateway without configuration. I think a more typical usage will be a service is configured to rewrite the request body to include the Idempotency-Key in a field in the body which is known to cause a conflict (e.g. a UNIQUE database column for user data).

(Also, I'm of the opinion that caching responses should be a feature in HTTP caching proper, not specific to Idempotency-Key; and that 412 Precondition Failed or maybe 409 Conflict is the the only appropriate response if the operation has previously executed and the response is not cachable. But for the sake of argument, let's assume all these POST responses are catchable somehow.)

jmileham commented 3 months ago

I'm going to stop here for now:

First consider the behavior with an idempotency caching gateway, which does not add idempotency but merely caches responses. With a cache, the origin (ideally) never sees the same idempotency key twice, as long as the origin responded 2xx. The only purpose is to reduce load on the origin.

I'd encourage you to read the idempotency draft in detail, and also read the documentation on the stripe idempotency implementation upon which it is based. Nobody besides you two folks on this thread has implemented or argued for an "idempotency service" that offers no guarantee of safety. You cannot meaningfully engage in a nuanced discussion of this topic until you have command of the material of the idempotency draft, its motivation, and pragmatically developed history in industry.

It's unproductive to engage in a discussion of alternative approaches that don't satisfy the needs served by the real systems that this draft documents.

awwright commented 3 months ago

Nobody besides you two folks on this thread has implemented or argued for an "idempotency service" that offers no guarantee of safety.

Who else is arguing your point? I'm perfectly familiar with the document and several concrete implementations; the question is about how this generalizes. If the header is going to function in any application, we are pointing out that an "idempotency service" cannot guarantee safety, this feature requires application-level integration; as this is a variant of the Byzantine Generals Problem.

Either the server sometimes receives the same message again (which is why Idempotency-Key is being proposed in the first place), or, important messages may become lost and can never be retransmitted, which is not acceptable in many applications. Imagine if two databases are replicating and the second never receives a message because the first can't be sure the first attempt wasn't received. That's tantamount to data loss; so not only is your proposal unsafe, but it creates a new class of errors that must be handled.

jmileham commented 3 months ago

Your personal definition of safety conflates safety with solving generalized distributed systems synchronization, which nobody would suggest this or any other protocol does. You then make the logical leap that one might as well relax the very real safety offered because it can't provide a guaranteed ability to advance, creating a strawman of the spec and existing implementations which you then propose alternative solutions for. We will not be able to get anywhere if you are not able to be rigorous in your reasoning.

jmileham commented 3 months ago

Hint: The inability for a non-domain-aware client to advance the consensus automatically under ambiguous response in the non transactional model is a feature, not a bug. It's not perfect but it's effective. I don't even implement the non transactional model in my work, but the safety guarantee it does provide is sound.

awwright commented 3 months ago

Your personal definition of safety conflates safety with solving generalized distributed systems synchronization, which nobody would suggest this or any other protocol does.

No, that's not the point I'm making. There's BGP, when two parties cannot coordinate action on an unreliable communication channel. But there's slight variations you can coordinate, e.g. "If I send you $400, will you send me a gift?" This class of action can be coordinated over any minimally reliable communication channel, and this is the sort that Idempotency-Key uses.

Could you please comment on my example? I'm not sure you understood the point I was making. A replicated database is a perfectly reasonable use case for Idempotency-Key. As a more concrete example, suppose I have a physical door that keeps an access/usage log, by making a POST request to record usage to a central database. What happens when the "idempotency service" loses the network connection with the origin in the middle of the request/response cycle? You said "resolving the unknown state of the request is impossible". This is data loss.

jmileham commented 3 months ago

If you reread carefully, I described the very property of the non transactional idempotency service not being able to resolve in case of anambiguous response multiple times myself before you even arrived on this thread, so I very much understand that property of the system. I even described it in issue #43, which I set up to track the discussion that you've threadjacked this issue to be about and described the issue there as well.

In your engagement on this thread, you've misunderstood, threadjacked, made false statements, explained properties already described as though you were introducing new insight, and jumped ahead of understanding to attempt solutioning from that intellectually precarious posture. You've avoided engaging with the examples I've provided to substantiate my points and provided your own examples, sometimes with different properties, and then expected me to engage with your framings. You've not earned the standing in this conversation to expect me to answer questions about your examples. If you seek understanding and consensus, consider attempting to understand your conversation partner enough to ask questions about theirs instead. I stand by the logical coherence of everything I've said in this thread and across all issues on this repo.

I will reiterate that nobody said that this protocol is sufficient to allow automatic advancement. But that is not tantamount to data loss. It would only be data loss if the party responsible for resolving the unresolved outcome (the human or system operating the client) decides to forget that it's unresolved and decides not to reconcile it out of band (it's not hard! Find out if the thing you requested happened via an out-of-band origin GET and cycle the idempotency key if it didn't). In the case where the client is retrying a durable job, the failing job itself serves as a ticket that the state is unresolved. That job code can be augmented to include reconciliation logic, and the whole class of failing job will self heal on next run.

Stripe's API idempotency service behaves exactly this way and has for years: After they perform whatever non-domain-specific request validation they can at the idempotency server, they store the idempotency key, proxy the request to the origin server and cache every origin server response, regardless of its response code.

If true, your argument would dictate that it is impossible for a client to use Stripe's API without incurring data loss, and that's not correct.

handrews commented 3 months ago

@awwright does some of your example fit better in #43, and if so, what points remain that do not depend on that standalone idea? It seems that you and @jmileham are talking past each other (although I can't quite unravel the how of it), and maybe considering that as a separate topic will help narrow it down?

I also feel like there may be a bit of a gap between some of your more theory-problem references (Byzantine Generals) and @jmileham's assertion that lack of data loss in Stripe's API proves that there are no problems. I may be crossing two argument streams with those examples, but perhaps it would help to work backwards from the assertion that Stripe works and see if you can find a contribution [EDIT: contradiction] from that real-world example? (I know @awwright is probably laughing at me, of all people, advocating for practical examples 😅 )

@jmileham I've known (and had plenty of contentious but ultimately productive GitHub issue discussions with) @awwright for nearly a decade. While he often approaches topics from an unexpected perspective that can be a bit challenging to resolve, he's not the sort of bad-faith threadjacker that you seem to be alledging. I think a methodical approach of trying to isolate specific contradictions between your examples and Austin's more theoretical descriptions, while agreeing to chunk off anything that fits better in other issues to those issues, can resolve this on the merits.

jmileham commented 3 months ago

Understood, thanks for the push @handrews. Happy to engage differently and spot more benefit of the doubt. Imbuing the web with more safety so we can collectively move faster with less harmful outcomes is something I'm very passionate about, so I've maybe been pushing a little hard to make sure the useful insight at the heart of this effort (which I admit I'm just showing up to very late) doesn't get short shrift. Even though I prefer the transactional approach to solving this problem server side (as I gather @awwright does), the larger opportunity to get more safety rolled out iteratively and more broadly, to the point of eventual ubiquity makes me very optimistic about the opportunity this draft represents.