kestra-io / kestra

Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
https://kestra.io
Apache License 2.0
7.15k stars 429 forks source link

Add an execution reference field (correlationId) #2059

Open tchiotludo opened 10 months ago

tchiotludo commented 10 months ago

see discussion here

There is a common notion of a key bound to execution instance. Existing flow processing tools refer to this key as business key or reference key. Such key is set at execution invocation and eventually gets copied to spawned child-processes/sub-flows. The key can be used for effective search based on the data being processed.

Such reference can be set via a combination of input & label, although having a dedicated property of execution might bring some benefits.

You have a data set which is characterized by a single ID - let's say a batch number. After you process that data, you receive a question regarding processing of that particular piece of data. It's like: "Show me the logs from processing of batch XYZ" or "When did we process the batch XYZ?" So you need to query the processing history and obtain all related executions.

We could add a simple reference on the execution, each subflow, trigger that start from that flow will copy the reference, and we need to have this field visible on the list and to allow search on that one.

loicmathieu commented 10 months ago

In the integration world (EAI/ESB/SOA) we refere at this by the name of correlation identifier (correlationId) as it allows to correlate multiple messages (here executions) together.

anna-geller commented 10 months ago

Problem

The user's main problem is the heavy reliance on labels. To analyze the state of workflows and task run metrics, some users filter those metrics based on Execution labels, which can be complex.

Use cases

"Our use case is to build parametrized flows orchestrating different processes with steps like input data validation via provided schema, modifying the data, and sending the data to an external system. Such flow is parametrized by inputs to use different validation schemas, perform different payload modifications, and use different connection parameters. We then use labels to differentiate those sets of parameters."

The reference key can be described as "a label identifying particular processed input data". Such key is identified prior feeding the data to Kestra. It is often a "business-relevant ID" contained in the data or a concatenated value from multiple sources.

It could be configured manually by the user in the flow definition.

Possible Solutions

The idea of this reference field is to create a surrogate key comprised of multiple fields for easier filtering of relevant operational data to answer questions such as "What are top 10 failing executions in the past 1h containing label env:prod grouped by label x and flowId => Which PROD flows of a "type" specified by the given label x are most troublesome?"

damienkilgannon commented 1 month ago

Since, this is been pushed back in priorities ... anybody have an example work around they use to achieve similar for the time been?

anna-geller commented 1 month ago

@damienkilgannon you can leverage the label task and filter executions by that label, e.g.:

  - id: label
    type: io.kestra.plugin.core.execution.Labels
    labels:
      url: "{{ outputs.some_task.value }}"