Open dlvenable opened 2 years ago
This looks good. I have a couple of comments:
You forgot to fill in the ids in the configuration for Some Configured Ids
Could you provide an example where there is a configuration file that splits into multiple pipelines? I am assuming the intention is to restart the count for a pipeline? If so is there an issue with two components getting the same id even if in separate pipelines? If not how is the incrementing decided? It would be nice to see if a pipeline
sink or source gets an id
as well.
As for the approach outlined vs the alternatives, I agree that having only duplicates with numbers is nice, but not really providing any value that makes extra effort worth it. I do like the idea of having a count across components, but I’m not sure if it’s really any better than the original solution proposed either.
Not entirely related to the id itself, but do you think there is value in adding each id an Event goes through to a list in the EventMetadata so that conditionals could check where an Event has been (and also have the complete ordered path of an Event recorded)
@graytaylor0 ,
You forgot to fill in the ids in the configuration for Some Configured Ids
Thanks, I updated this.
Could you provide an example where there is a configuration file that splits into multiple pipelines? I am assuming the intention is to restart the count for a pipeline? If so is there an issue with two components getting the same id even if in separate pipelines? If not how is the incrementing decided? It would be nice to see if a pipeline sink or source gets an id as well.
I added some clarification to the beginning of this issue description. And I added a brief section on Fully Qualified Ids.
I envision that the Ids within each pipeline are processed independently of each other. So Data Prepper may have two opensearch
plugins. But, the fully qualified Ids yields: pipelineA.opensearch
and pipelineB.opensearch
.
As for the approach outlined vs the alternatives, I agree that having only duplicates with numbers is nice, but not really providing any value that makes extra effort worth it. I do like the idea of having a count across components, but I’m not sure if it’s really any better than the original solution proposed either.
My thinking is that when pipeline authors don't define plugin Ids, the Ids aren't too meaningful to them. So it seems Data Prepper should take the simplest solution. But, if other pipeline authors want to weigh in here, I think we can take different approaches.
Not entirely related to the id itself, but do you think there is value in adding each id an Event goes through to a list in the EventMetadata so that conditionals could check where an Event has been (and also have the complete ordered path of an Event recorded)
This is an attribution feature that I've discussed with a few users and colleagues. I do think this would be valuable, but needs to be a distinct issue.
Wondering if we can achieve this idea with conditional routing ? id : can be like route: and it will help to keep it consistent with Conditional Routing feature.
Introduction
A Data Prepper pipeline can contain multiple sources, processors, and sinks with the same type. Presently, these cannot be distinguished.
Proposed Solution
Data Prepper should assign a unique identifier for each pipeline component. The scope of the Id is within the current pipeline. There will be a fully-qualified Id, which is discussed at the end of this issue description. For most of this discussion, the Id is unique only within a single pipeline.
Additionally, pipeline authors may wish to configure some component Ids. This can help them debug their pipelines and make them more readable.
The following example shows how a pipeline author can configure the Id using an
id
property:Pipeline authors do not need to configure the
id
. Data Prepper will produce a default value.Id Generation
The default Id generation should be deterministic. This will allow the peer-forwarder to use the
id
of a component and consistently supply Events to the correct component in a peer node.The default Id generation can be:
The
incrementedCount
will be a number which is incremented for each component type individually. It can thus be stored in a map:Map<String, Integer> typeToIncrementedCount
. The count will be incremented before applying the function above. So the first of any given type hasincrementedCount == 1
. This approach allows the pipelines without duplicates to continue to use thepluginType
without having a trailing1
.Examples
No Configured Ids
The Ids are:
http
grok
opensearch
opensearch2
Some Configured Ids
The Ids are:
http
extract-apache-logs
opensearch-a
opensearch2
Alternatives
Duplicates Always Have Count Suffix
Another approach is to identify any plugin type that has more than one plugin in the pipeline. Only those that have more than one will have a suffix. This can be nice because each plugin of the same type has a more consistent name.
The disadvantage is that it may be more complicated to support. Is the improvement to the name really worthwhile here? Pipeline authors who want better names can control the
id
already.The Ids are:
http
grok
opensearch1
opensearch2
Count Across Components
Data Prepper could increment a universal count. The disadvantage is that when there is only one processor of a type it gets some number behind it.
The Ids are:
http1
grok2
opensearch3
opensearch4
Fully Scoped Ids
Data Prepper will generate and validate plugin Ids only within a single plugin. Additionally, Data Prepper will support fully qualified component Ids. A fully-qualified plugin Id will be unique across all pipelines. The format will be:
This format is based on the current convention for plugin metrics. Data Prepper currently defines metrics by:
Tasks