littlehorse-enterprises / littlehorse

This repository contains the code for the LittleHorse Server, Dashboard, CLI, and Java/Go/Python SDK's. Brought to you by LittleHorse Enterprises LLC
https://littlehorse.dev/
Other
114 stars 10 forks source link

Notifications for `ERROR` state `WfRun`s #584

Open coltmcnealy-lh opened 9 months ago

coltmcnealy-lh commented 9 months ago

Context

When a WfRun reaches the ERROR state, that means that some Very Bad Thing has happened in the technological sense (i.e. server or network failure, WfSpec bug, etc) that the developers didn't want to happen.

Currently, there is no push-based mechanism for alerting LH Users that this has happened. They can see this in the metrics on the Dashboard or via WfRun search, but those are pull-based approcahes.

Description

We need a proactive way of notifying LH Users (i.e. developers) that a Very Bad Thing has happened.

Acceptance Criteria

The implementation:

As such, simply adding a Prometheus metric for number of failed WfRun's in the ERROR state will not be sufficient.

Out of Scope

No response

Technical Context

There are two ways this can work:

  1. Kafka topic with output events
  2. Allow users to register a special kind of TaskDef to be executed when a WfRun of a specific WfSpec gets into the ERROR state.

Problems with 1:

Benefits of 2:

Components

coltmcnealy-lh commented 1 month ago

We will go with the Output Topic