Notifications for `ERROR` state `WfRun`s

Context

When a WfRun reaches the ERROR state, that means that some Very Bad Thing has happened in the technological sense (i.e. server or network failure, WfSpec bug, etc) that the developers didn't want to happen.

Currently, there is no push-based mechanism for alerting LH Users that this has happened. They can see this in the metrics on the Dashboard or via WfRun search, but those are pull-based approcahes.

Description

We need a proactive way of notifying LH Users (i.e. developers) that a Very Bad Thing has happened.

Acceptance Criteria

The implementation:

work in a multi-tenant LittleHorse installation
work when users cannot access the Prometheus port (eg. in LH Cloud)
work for users who use alerting mechanisms which are not Prometheus-compatible

As such, simply adding a Prometheus metric for number of failed WfRun's in the ERROR state will not be sufficient.

Out of Scope

No response

Technical Context

There are two ways this can work:

Kafka topic with output events
Allow users to register a special kind of TaskDef to be executed when a WfRun of a specific WfSpec gets into the ERROR state.

Problems with 1:

We would need to dynamically create a new Kafka Topic for each Tenant
All events for all WfSpec's would likely end up on the same Kafka Topic, otherwise we would end up creating way too many topics
It would require users to be familiar with Kafka in order to take advantage of this feature. Currently, Kafka is transparent

Benefits of 2:

Fits into multi-tenant LH seamlessly
Easily configurable on a per-WfSpec basis
Very easy in terms of extensibility

Components

[ ] Dashboard
[ ] Server
[ ] Python SDK
[ ] Go SDK
[ ] Java SDK
[ ] C# SDK
[ ] LH Control
[ ] LH Tests Utils

littlehorse-enterprises / littlehorse