opensearch-project / observability

Visualize and explore your logs, traces and metrics data in OpenSearch Dashboards
https://opensearch.org/docs/latest/observability-plugin/index/
Apache License 2.0
53 stars 95 forks source link

[RFC]Observability Workflow API #1805

Open YANG-DB opened 4 months ago

YANG-DB commented 4 months ago

Is your feature request related to a problem? This RFC proposes the introduction of a new feature within the OpenSearch Observability plugin: a workflow-based API named "Workflow".

This API is designed to orchestrate multiple steps, each corresponding to a separate API call to a core API within OpenSearch. The primary goal of the Workflow API is to facilitate complex data preparation processes involving enrichment, aggregation, reindexing, and other transformations.

By enabling users to define multi-step workflows, this feature aims to simplify the preparation of raw data for visualization, thereby providing deeper insights into system status and health.

Proposal Workflow Definition A workflow is defined as a sequence of steps, each of which corresponds to an API call within OpenSearch. Users can specify the details of each step, including the API endpoint to be called, the HTTP method to use, and the body of the request. Workflows are defined in JSON format, allowing for easy creation, modification, and sharing.

Workflow Structure A typical workflow definition includes the following components:

Example

The workflow metadata parameters:

{
  "name": "Reds-Services-Workflow",
  "description": "This workflow creates a RED transformation metrics top measure services performance",
  "version": "1.0.0",
  "parameters": [
    "index_name",
    "rollup_name",
    "current_time",
    "start_time"
  ],

The workflow steps parameters:

   ....
  "steps": [
    {
      "name": "add_timestamp_field_to_otel_spans_index",
      "method": "PUT",
      "endpoint": "/${index_name}",
      "body": { ... }
     },
     ....
    ]

Workflow Backend Support

The workflow json request will be stored as a versioned separate document, this document will be enriched with the http response returned by the step's endpoint.

Each workflow request will be assigned with an ID which can be later queries for status.

A status API will be available for query and will give the workflow state according to each step status status:

.../workflow/${id}/status - will return the step's status according to the current state.

Example

{
  "name": "Reds-Services-Workflow",
  "description": "This workflow creates a RED transformation metrics top measure services performance",
  "version": "1.0.0",
  "@timestamp": "2024-01-28T02:31:51.379677283Z"
  "steps": [
    {
      "name": "add_timestamp_field_to_otel_spans_index",
      "method": "PUT",
      "endpoint": "/${index_name}",
      "startTime": "2024-01-28T02:31:51.379677283Z",
      "endTime": "2024-01-28T02:31:51.379677283Z",
      "status": "complete"
    },
    {
      "name": "reindex_old_otel_spans_index_to_new",
      "method": "POST",
      "endpoint": "_reindex",
      "startTime": "2024-01-28T02:31:51.379677283Z",
      "endTime": "2024-01-28T02:31:51.379677283Z",
      "status": "complete"
    },
    {
      "name": "rollup_REDS_services_metrics_index",
      "method": "PUT",
      "endpoint": "_plugins/_rollup/jobs/${rollup_name}",
      "startTime": "2024-01-28T02:31:51.379677283Z",
      "status": "running",
      "task": {
        "nodes": {
          "Mgqdm0r9SEGClWxp_RbnaQ": {
            "name": "opensearch-node1",
            "transport_address": "172.18.0.3:9300",
            "host": "172.18.0.3",
            "ip": "172.18.0.3:9300",
            "roles": [
              "data",
              "ingest",
              "master",
              "remote_cluster_client"
            ],
            "tasks": {
              "Mgqdm0r9SEGClWxp_RbnaQ:17416": {...}
            },
            "Mgqdm0r9SEGClWxp_RbnaQ:17413": {...},
            "Mgqdm0r9SEGClWxp_RbnaQ:17366": {...}
          }
        }
      }
    }
  ]
}

Example Workflow: RED Metrics for Service Performance

Below is an example workflow definition request designed to create RED (Rate, Errors, Duration) transformation metrics to measure service performance.

The next workflow (described here) is responsible of preparing a transformation aggregative index that contains the RED (requests, errors, duration) summaries for the Observability services as they defined by the OTEL schema.

{
  "name": "Reds-Services-Workflow",
  "description": "This workflow creates a RED transformation metrics top measure services performance",
  "version": "1.0.0",
  "parameters": [
    "index_name",
    "rollup_name",
    "current_time",
    "start_time"
  ],
  "steps": [
    {
      "name": "add_timestamp_field_to_otel_spans_index",
      "method": "PUT",
      "endpoint": "/${index_name}",
      "body": {
        "mappings": { ... }
      }
    },
    {
      "name": "reindex_old_otel_spans_index_to_new",
      "method": "POST",
      "endpoint": "_reindex",
      "body": {
        "source": {
          "index": "otel-v1-apm-span-*"
        },
        "dest": {
          "index": "${index_name}"
        },
        "script": {
          "source": "ctx._source['@timestamp'] = ctx._source.startTime;"
        }
      }
    },
    {
      "name": "rollup_REDS_services_metrics_index",
      "method": "PUT",
      "endpoint": "_plugins/_rollup/jobs/${rollup_name}",
      "body": {

        "rollup": {
          "rollup_id": "rollup_REDS_services_metrics_index",
          "enabled": true,
          "schedule": {
            "interval": {
              "start_time": 1706146764865,
              "period": 1,
              "unit": "Minutes",
              "schedule_delay": 0
            }
          },
          "description": "",
          "schema_version": 19,
          "source_index": "otel-v1-apm-span-000072-fixed",
          "target_index": "rollup_services_metrics_index",
          "metadata_id": "58NGPo0B1DvKaqdigtmw",
          "page_size": 1000,
          "delay": 0,
          "continuous": true,
          "dimensions": [
            {
              "date_histogram": {
                "fixed_interval": "1h",
                "source_field": "@timestamp",
                "target_field": "@timestamp",
                "timezone": "UTC",
                "format": null
              }
            },
            {
              "terms": {
                "source_field": "serviceName",
                "target_field": "serviceName"
              }
            },
            {
              "terms": {
                "source_field": "@timestamp",
                "target_field": "@timestamp"
              }
            },
            {
              "histogram": {
                "source_field": "status.code",
                "target_field": "status.code",
                "interval": 1
              }
            },
            {
              "terms": {
                "source_field": "span.attributes.http@status_code",
                "target_field": "span.attributes.http@status_code"
              }
            }
          ],
          "metrics": [
            {
              "source_field": "durationInNanos",
              "metrics": [
                {
                  "max": {}
                },
                {
                  "avg": {}
                }
              ]
            }
          ]
        }
      }
    }
  ]
}

Do you have any additional context?

dbwiddis commented 4 months ago

This looks like a fantastic use case for https://github.com/opensearch-project/flow-framework

minalsha commented 4 months ago

@dbwiddis is working on this. See https://github.com/opensearch-project/flow-framework/issues/522