Open mwadams opened 4 years ago
Initial thoughts include acquiring the lease for the particular instance outside of the engine, and not releasing it until we explicitly leave the engine - i.e. if the engine faults, the lease is still acquired which would prevent that particular instance being processed again until it is fixed. However, logging and alerting would need to be improved, as would forensic tools to actively seek faults which went unreported (e.g. because the Azure infrastructure went down at a critical point during processing.)
While we deal with a particular workflow instance entering a Faulted state in a robust manner that prevents the workflow "soldiering on" and potentially corrupting the system state, if the WorkflowEngine itself fails (typically when persisting a
WorkflowInstance
or the corresponding change log entry), our current approach is not necessarily the most robust.We need to consider recovery scenarios, and distributed faulting where we have scaled the engine out.