opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
238 stars 176 forks source link

Support multiple sources for a pipeline #406

Open laneholloway opened 2 years ago

laneholloway commented 2 years ago

Is your feature request related to a problem? Please describe. Currently, Data Prepper pipelines only accept data from one input, I'd like to be able to bring in data from multiple inputs into a single pipeline.

Describe the solution you'd like Data Prepper pipelines support multiple inputs into a single pipeline.

Describe alternatives you've considered (Optional) N/A

Additional context N/A

dlvenable commented 1 year ago

Supporting multiple sources could be done by continuing to share a single buffer. The following diagram outlines how this would work conceptually.

MultipleSources

The pipeline syntax for this approach would be to change the source key to have a list of sources, rather than a single source.

The current YAML might look like the following.

my-pipeline:
  source:
    http:
      port: 2021
      ssl: true
  buffer:
    bounded_blocking:

With this change, it would instead look like the following.

my-pipeline:
  source:
    - http:
         port: 2021
         ssl: true
    - http:
         port: 2022
         ssl: false
  buffer:
    bounded_blocking:
dlvenable commented 1 year ago

An alternative design is to continue to support a single source per pipeline. But, the ability to connect multiple pipelines could be expanded such that using a pipeline source can read from multiple pipelines.

The following diagram shows how this would work in a simple use-case where three sources share all their processing. This matches the example configure in my previous comment.

MultipleSourcesViaPipelineConnector

The YAML for this approach would be more involved:

my-pipeline-port-2021:
  source:
    http:
      port: 2021
      ssl: true
  buffer:
    bounded_blocking:
  sink:
    - pipeline:
        name: my-pipeline-combined

my-pipeline-port-2022:
  source:
    http:
      port: 2022
      ssl: true
  buffer:
    bounded_blocking:
  sink:
    - pipeline:
        name: my-pipeline-combined

my-pipeline-combined:
  source:
    pipeline:
      names:
        - my-pipeline-port-2020
        - my-pipeline-port-2021
  buffer:
    bounded_blocking:
  processor:
    # As usual

There are a few advantages to this alternative approach.

This approach has some downsides:

n.b. There might be other ways to configure the pipeline source to receive from multiple sources. My goal here is more to convey a possible solution to multiple sources, not to fully design how the pipeline source could work.