A question about resiliency and retry.

gzp79 commented 4 months ago

If i understand correctly chorus could be used to control conplex interactions, "transactions" betweeen microservices. How do you (plan) to handle channel issues, retries and similar things ? For example using chorus for something like a payment where you have to update the balance of a user, reduce the number in the stock, send email notifications etc.. There are soo many parts where this can break: message sent, processed locally but result never gets "broadcasted" to the other parties due to some network error. Could/Should the parties store the completed status and make each step indempotent on retry. How should this retry be implemented ? Or this crate is not designed for handling such "saga" ? Thanks.

shumbo commented 3 months ago

Hi @gzp79, thanks for starting this discussion.

Unfortunately, handling faults is something choreographic programming is not good at (yet), and there are many things we'd want to implement in production that can't be expressed as a choreography.

It should be fairly easy to handle retries with a custom transport. That is, choreographies don't deal with retries, but the underlying transport will make a retry in case of a network error. It should also be possible to extend this idea to make requests idempotent and deal with duplicated messages --- the sender transport can attach an idempotency key, and the receiver transport can check if the request has been processed (and if it has already processed the request, it can respond with the saved response). It may be sufficient to tolerate occasional network omissions.

However, I understand that other things could happen. For example, we might want atomicity and abort the transaction if a node crashes. As far as I know, there is no good way to describe those complicated error handling from choreographies. The choreographic programming research community is actively exploring different approaches to overcome those limitations, though.

gzp79 commented 3 months ago

Thanks. I implemented some similar saga and idempotency was a critical part of it too. We never managed to cover all edge cases, but in practice it was working quite well. I was just hoping that there is some more sophisticated solution, but we have to wait for the researchers a bit more 😄 .

lsd-ucsc / ChoRus

A question about resiliency and retry. #28