opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
256 stars 188 forks source link

Pipeline Component Ids #1025

Open dlvenable opened 2 years ago

dlvenable commented 2 years ago

Introduction

A Data Prepper pipeline can contain multiple sources, processors, and sinks with the same type. Presently, these cannot be distinguished.

Proposed Solution

Data Prepper should assign a unique identifier for each pipeline component. The scope of the Id is within the current pipeline. There will be a fully-qualified Id, which is discussed at the end of this issue description. For most of this discussion, the Id is unique only within a single pipeline.

Additionally, pipeline authors may wish to configure some component Ids. This can help them debug their pipelines and make them more readable.

The following example shows how a pipeline author can configure the Id using an id property:

log-pipeline:
  source:
    http:
  prepper:
    - grok:
        id: extract-apache-logs
        match:
          log: [ "%{COMMONAPACHELOG}" ]
  sink:
    - opensearch:
        id: opensearch-a
        hosts: [ "https://opensearch-host-a" ]
    - opensearch:
        id: opensearch-a
        hosts: [ "https://opensearch-host-b" ]

Pipeline authors do not need to configure the id. Data Prepper will produce a default value.

Id Generation

The default Id generation should be deterministic. This will allow the peer-forwarder to use the id of a component and consistently supply Events to the correct component in a peer node.

The default Id generation can be:

${pluginType}${incrementedCount > 1 ? incrementedCount : ''}

The incrementedCount will be a number which is incremented for each component type individually. It can thus be stored in a map: Map<String, Integer> typeToIncrementedCount. The count will be incremented before applying the function above. So the first of any given type has incrementedCount == 1. This approach allows the pipelines without duplicates to continue to use the pluginType without having a trailing 1.

Examples

No Configured Ids

log-pipeline:
  source:
    http:
  prepper:
    - grok:
        match:
          log: [ "%{COMMONAPACHELOG}" ]
  sink:
    - opensearch:
        hosts: [ "https://opensearch-host-a" ]
    - opensearch:
        hosts: [ "https://opensearch-host-b" ]

The Ids are:

Some Configured Ids

log-pipeline:
  source:
    http:
  prepper:
    - grok:
        match:
          id: extract-apache-logs
          log: [ "%{COMMONAPACHELOG}" ]
  sink:
    - opensearch:
        id: opensearch-a
        hosts: [ "https://opensearch-host-a" ]
    - opensearch:
        hosts: [ "https://opensearch-host-b" ]

The Ids are:

Alternatives

Duplicates Always Have Count Suffix

Another approach is to identify any plugin type that has more than one plugin in the pipeline. Only those that have more than one will have a suffix. This can be nice because each plugin of the same type has a more consistent name.

The disadvantage is that it may be more complicated to support. Is the improvement to the name really worthwhile here? Pipeline authors who want better names can control the id already.

log-pipeline:
  source:
    http:
  prepper:
    - grok:
        match:
          log: [ "%{COMMONAPACHELOG}" ]
  sink:
    - opensearch:
        hosts: [ "https://opensearch-host-a" ]
    - opensearch:
        hosts: [ "https://opensearch-host-b" ]

The Ids are:

Count Across Components

Data Prepper could increment a universal count. The disadvantage is that when there is only one processor of a type it gets some number behind it.

log-pipeline:
  source:
    http:
  prepper:
    - grok:
        match:
          log: [ "%{COMMONAPACHELOG}" ]
  sink:
    - opensearch:
        hosts: [ "https://opensearch-host-a" ]
    - opensearch:
        hosts: [ "https://opensearch-host-b" ]

The Ids are:

Fully Scoped Ids

Data Prepper will generate and validate plugin Ids only within a single plugin. Additionally, Data Prepper will support fully qualified component Ids. A fully-qualified plugin Id will be unique across all pipelines. The format will be:

{pipelineName}.{pluginId}

This format is based on the current convention for plugin metrics. Data Prepper currently defines metrics by:

{pipelineName}.{pluginType}.{metricName}

Tasks

graytaylor0 commented 2 years ago

This looks good. I have a couple of comments:

  1. You forgot to fill in the ids in the configuration for Some Configured Ids

  2. Could you provide an example where there is a configuration file that splits into multiple pipelines? I am assuming the intention is to restart the count for a pipeline? If so is there an issue with two components getting the same id even if in separate pipelines? If not how is the incrementing decided? It would be nice to see if a pipeline sink or source gets an id as well.

  3. As for the approach outlined vs the alternatives, I agree that having only duplicates with numbers is nice, but not really providing any value that makes extra effort worth it. I do like the idea of having a count across components, but I’m not sure if it’s really any better than the original solution proposed either.

  4. Not entirely related to the id itself, but do you think there is value in adding each id an Event goes through to a list in the EventMetadata so that conditionals could check where an Event has been (and also have the complete ordered path of an Event recorded)

dlvenable commented 2 years ago

@graytaylor0 ,

You forgot to fill in the ids in the configuration for Some Configured Ids

Thanks, I updated this.

Could you provide an example where there is a configuration file that splits into multiple pipelines? I am assuming the intention is to restart the count for a pipeline? If so is there an issue with two components getting the same id even if in separate pipelines? If not how is the incrementing decided? It would be nice to see if a pipeline sink or source gets an id as well.

I added some clarification to the beginning of this issue description. And I added a brief section on Fully Qualified Ids.

I envision that the Ids within each pipeline are processed independently of each other. So Data Prepper may have two opensearch plugins. But, the fully qualified Ids yields: pipelineA.opensearch and pipelineB.opensearch.

As for the approach outlined vs the alternatives, I agree that having only duplicates with numbers is nice, but not really providing any value that makes extra effort worth it. I do like the idea of having a count across components, but I’m not sure if it’s really any better than the original solution proposed either.

My thinking is that when pipeline authors don't define plugin Ids, the Ids aren't too meaningful to them. So it seems Data Prepper should take the simplest solution. But, if other pipeline authors want to weigh in here, I think we can take different approaches.

Not entirely related to the id itself, but do you think there is value in adding each id an Event goes through to a list in the EventMetadata so that conditionals could check where an Event has been (and also have the complete ordered path of an Event recorded)

This is an attribution feature that I've discussed with a few users and colleagues. I do think this would be valuable, but needs to be a distinct issue.

sharraj commented 1 year ago

Wondering if we can achieve this idea with conditional routing ? id : can be like route: and it will help to keep it consistent with Conditional Routing feature.