Design for unreliable communication (HTTPS)

esear commented 11 years ago

I wanted to float an idea for how to design for unreliable communications. Here are the conditions we need to design for:

REQUEST message is lost between client and server
RESPONSE is lost between server and client In either instance the client needs to be able to duplicate the request without creating a duplicate transaction on our side.

The design pattern I propose is to leverage idempotent calls (where appropriate) with unique IDs. For example, use a /transaction service to set-up every interaction with Mercury Web Services. (Note, I am not sure this name will work because Travis wants to utilize transactions for "orders")

To initiate a /payment, /gift, /loyalty action the client would first call a /transaction service to get a transaction ID (MWS transID). Request: PUT /transaction with the client providing a unique transaction ID from their system Response: MWS transID

if this request fails, the client times out and simply makes another PUT and we return a new MWS transID
if this response fails, the client times out and makes another PUT but we return the same MWS transID associated to original request b/c we can associate it to the POS unique transaction ID

On future API calls that are part of the same sequence/workflow (for example, completing a credit sale) the client includes the transID in their requests. This way we can identify duplicate requests.

For example, a /payment/sale workflow would look something like this:

PUT /transaction providing a POS transaction ID
Response from MWS provides a Mercury transID
PUT /payment/sale with sale object associated to MWS transID if request fails, client simply tries again (no record on server)
Response from MWS provides SUCCESS message (200) plus batchID associated to sale (or something like that). if response fails, client tries again and b/c of the transID the server sees it is a duplicate and returns the same response as the first call

mozvat commented 11 years ago

Good topic. it could touch upon another API tenant, Failover? and a POS workflow discussion which we are slowly discovering with the Prototype. But, as we are still initially defining nomenclature/object modeling 'Transaction', etc...

But, maybe we can first identify some SLAs surrounding this topic though. What is the SLA surrounding failed requests or even generic transaction response time SLA? After identifying the SLAs we can come up with a 'how'.

The other topic that this touches on is changing the number of times a POS dev is required to touch the Mercury Platform. Making a unique call to get an initial 'PaymentID' or TransactionID is a pattern we put in place for Hosted Checkout Web Service, but this was for the reason of mitigating MITM attacks.

I don't yet fully understand the value of doing this for a credit transaction other than MITM attacks. I really don't see a 'Problem' that this is solving that is felt today. We dont really have communication reliability problems and the workflow today enables this process fairly well...

Also, the other concern is, there is a 'understood' workflow that is accepted within the Payments Industry. If we want to change this workflow then we have to be careful as we will be bringing on another challenge. I think understanding why we don't do this today is good fundamental information.

I would like to see some research of others doing/not-doing this pattern other than a MITM attack.

Let's discuss though with the group and Bill when we cross this bridge, this is a good topic.

esear commented 11 years ago

Excellent. Yes, my intent for posting it here was to have the discussion with you, the group and Bill. The value of doing this is that REST uses HTTP/s which is inherently an unreliable protocol. As a result you need to design your system to account for the loss of either a request or response. You can do this if you can create idempotent calls (make the same call more than once and not end up with 2 instances of the "resource").

Yeah, I hadn't thought of this approach also protecting against MITM and replay attacks - but good point. I think you are also protecting against MITM by using SSL. You also protect against replay events by validating the timestamp of the request.

Failover: should happen on the server (Mercury) side and be automatic based on unavailability of primary. Agreed, I think we (I, product) need to identify the SLAs.

mozvat / WSPAPIPrototype

Design for unreliable communication (HTTPS) #26