marain-dotnet / Marain.Workflow

Highly scalable workflow service. Sponsored by endjin.
GNU Affero General Public License v3.0
5 stars 1 forks source link

Determine the best approach to faulting the workflow engine. #139

Open mwadams opened 4 years ago

mwadams commented 4 years ago

While we deal with a particular workflow instance entering a Faulted state in a robust manner that prevents the workflow "soldiering on" and potentially corrupting the system state, if the WorkflowEngine itself fails (typically when persisting a WorkflowInstance or the corresponding change log entry), our current approach is not necessarily the most robust.

We need to consider recovery scenarios, and distributed faulting where we have scaled the engine out.

mwadams commented 4 years ago

Initial thoughts include acquiring the lease for the particular instance outside of the engine, and not releasing it until we explicitly leave the engine - i.e. if the engine faults, the lease is still acquired which would prevent that particular instance being processed again until it is fixed. However, logging and alerting would need to be improved, as would forensic tools to actively seek faults which went unreported (e.g. because the Azure infrastructure went down at a critical point during processing.)