When a WfRun reaches the ERROR state, that means that some Very Bad Thing has happened in the technological sense (i.e. server or network failure, WfSpec bug, etc) that the developers didn't want to happen.
Currently, there is no push-based mechanism for alerting LH Users that this has happened. They can see this in the metrics on the Dashboard or via WfRun search, but those are pull-based approcahes.
Description
We need a proactive way of notifying LH Users (i.e. developers) that a Very Bad Thing has happened.
Acceptance Criteria
The implementation:
work in a multi-tenant LittleHorse installation
work when users cannot access the Prometheus port (eg. in LH Cloud)
work for users who use alerting mechanisms which are not Prometheus-compatible
As such, simply adding a Prometheus metric for number of failed WfRun's in the ERROR state will not be sufficient.
Out of Scope
No response
Technical Context
There are two ways this can work:
Kafka topic with output events
Allow users to register a special kind of TaskDef to be executed when a WfRun of a specific WfSpec gets into the ERROR state.
Problems with 1:
We would need to dynamically create a new Kafka Topic for each Tenant
All events for all WfSpec's would likely end up on the same Kafka Topic, otherwise we would end up creating way too many topics
It would require users to be familiar with Kafka in order to take advantage of this feature. Currently, Kafka is transparent
Context
When a
WfRun
reaches theERROR
state, that means that some Very Bad Thing has happened in the technological sense (i.e. server or network failure,WfSpec
bug, etc) that the developers didn't want to happen.Currently, there is no push-based mechanism for alerting LH Users that this has happened. They can see this in the metrics on the Dashboard or via
WfRun
search, but those are pull-based approcahes.Description
We need a proactive way of notifying LH Users (i.e. developers) that a Very Bad Thing has happened.
Acceptance Criteria
The implementation:
As such, simply adding a Prometheus metric for number of failed
WfRun
's in theERROR
state will not be sufficient.Out of Scope
No response
Technical Context
There are two ways this can work:
TaskDef
to be executed when aWfRun
of a specificWfSpec
gets into theERROR
state.Problems with 1:
Tenant
WfSpec
's would likely end up on the same Kafka Topic, otherwise we would end up creating way too many topicsBenefits of 2:
WfSpec
basisComponents