hasura / 3factor-example

Canonical example of building a 3factor app : a food ordering application
https://3factor.app
460 stars 44 forks source link

Preventing cloud event recursion #29

Open tohagan opened 3 years ago

tohagan commented 3 years ago

Because the 3 factor pattern is inherently cyclic, you run the risk of inadvertently triggering a infinite event loop (DB -> Function -> DB ad infinitum). On auto scaling cloud services, that kind of bug can get costly real fast so I think its a significant risk of this pattern. Apart from budget alerts, are there recommended methods for avoiding this in a 3 factor apps?

tohagan commented 3 years ago

One solution comes to mind ...

tohagan commented 3 years ago

Event threading is an old trick but you might consider it worth documenting as part of 3 factor examples or perhaps even supporting in the framework.

Here's the backstory that you've probably already read.

tirumaraiselvan commented 3 years ago

Great question!

I can tell you how you can address this in Hasura GraphQL Engine.

With Hasura Event Triggers, what you can do is choose the type of operation (INSERT, UPDATE, DELETE) and also choose the "LISTEN" columns for updates: https://hasura.io/docs/1.0/graphql/core/event-triggers/create-trigger.html#listen-columns-for-update

We are also planning conditional triggers: https://github.com/hasura/graphql-engine/issues/1241 which will invoke a function only if some boolean expression (on the old or new row) is satisfied.

tohagan commented 3 years ago

Having conditional logic as you've described to filter triggers does not prevent this kind of bug. Adding a condition may help you fix it but it won't detect or prevent it. The problem is that just looking at the Event Triggers (and their conditions) it's not obvious that the loop even exists. That's because the programmer is only seeing the 1st part of the loop code. The 2nd part is buried in the logic of the serverless function that performs a database update that then fires a new Event trigger. This kind of distributed execution path is inherently non-obvious as the programmer can easily miss seeing the complete call graph because the "code" is split between the two systems. The programmer is also likely to only unit test the function logic (possibly even stubbing database updates) and thus may miss detecting the distributed event loop.

A safe solution needs to behave similar to the way that a compiler/interpreter runtime or CPU hardware detects "stack overflow" except in our case the "calls" are distributed. To detect and prevent this you need a generic method (ideally supported by the framework) that computes the nested call depth between distributed execution threads. That's the solution I've proposed above. Identifying distributed threads has many other diagnostic benefits especially for long running (workflow) processes.

I can think of one case where an infinite distributed loop is ok. That's where we setup a chain of calls (events) with a computed delay performed by the serverless function between each event. Of course commonly we'd use a scheduled trigger (cron) for this but sometimes the delay intervals between these scheduled call need to be computed each time with hand crafted code. So to cater for this scenario, you'd need to ensure that the call depth check is optional.