Preventing cloud event recursion

tohagan commented 3 years ago

Because the 3 factor pattern is inherently cyclic, you run the risk of inadvertently triggering a infinite event loop (DB -> Function -> DB ad infinitum). On auto scaling cloud services, that kind of bug can get costly real fast so I think its a significant risk of this pattern. Apart from budget alerts, are there recommended methods for avoiding this in a 3 factor apps?

tohagan commented 3 years ago

One solution comes to mind ...

Create command events with thread_id = request_id and thread_level = 0.
Business logic functions that execute command events create new events using the original thread_id and with thread_level = thread_level + 1 and of course a new request_id.
Function middleware can check if a command event includes thread_level and abort if it exceeds a threshold value (max event depth).
thread_id can be used in any diagnostics or reporting that requires event correlation including long running workflows, scheduled events, concurrent events etc.

tohagan commented 3 years ago

Event threading is an old trick but you might consider it worth documenting as part of 3 factor examples or perhaps even supporting in the framework.

Here's the backstory that you've probably already read.

https://blog.tomilkieway.com/72k-1/
https://blog.tomilkieway.com/72k-2/ - The event loop in Part 2 smells awfully like 3 factor pattern

tirumaraiselvan commented 3 years ago

Great question!

I can tell you how you can address this in Hasura GraphQL Engine.

With Hasura Event Triggers, what you can do is choose the type of operation (INSERT, UPDATE, DELETE) and also choose the "LISTEN" columns for updates: https://hasura.io/docs/1.0/graphql/core/event-triggers/create-trigger.html#listen-columns-for-update

We are also planning conditional triggers: https://github.com/hasura/graphql-engine/issues/1241 which will invoke a function only if some boolean expression (on the old or new row) is satisfied.

tohagan commented 3 years ago

Having conditional logic as you've described to filter triggers does not prevent this kind of bug. Adding a condition may help you fix it but it won't detect or prevent it. The problem is that just looking at the Event Triggers (and their conditions) it's not obvious that the loop even exists. That's because the programmer is only seeing the 1st part of the loop code. The 2nd part is buried in the logic of the serverless function that performs a database update that then fires a new Event trigger. This kind of distributed execution path is inherently non-obvious as the programmer can easily miss seeing the complete call graph because the "code" is split between the two systems. The programmer is also likely to only unit test the function logic (possibly even stubbing database updates) and thus may miss detecting the distributed event loop.

A safe solution needs to behave similar to the way that a compiler/interpreter runtime or CPU hardware detects "stack overflow" except in our case the "calls" are distributed. To detect and prevent this you need a generic method (ideally supported by the framework) that computes the nested call depth between distributed execution threads. That's the solution I've proposed above. Identifying distributed threads has many other diagnostic benefits especially for long running (workflow) processes.

I can think of one case where an infinite distributed loop is ok. That's where we setup a chain of calls (events) with a computed delay performed by the serverless function between each event. Of course commonly we'd use a scheduled trigger (cron) for this but sometimes the delay intervals between these scheduled call need to be computed each time with hand crafted code. So to cater for this scenario, you'd need to ensure that the call depth check is optional.

hasura / 3factor-example

Preventing cloud event recursion #29