Support multiple sources for a pipeline

laneholloway commented 3 years ago

Is your feature request related to a problem? Please describe. Currently, Data Prepper pipelines only accept data from one input, I'd like to be able to bring in data from multiple inputs into a single pipeline.

Describe the solution you'd like Data Prepper pipelines support multiple inputs into a single pipeline.

Describe alternatives you've considered (Optional) N/A

Additional context N/A

dlvenable commented 2 years ago

Supporting multiple sources could be done by continuing to share a single buffer. The following diagram outlines how this would work conceptually.

MultipleSources

The pipeline syntax for this approach would be to change the source key to have a list of sources, rather than a single source.

The current YAML might look like the following.

my-pipeline:
  source:
    http:
      port: 2021
      ssl: true
  buffer:
    bounded_blocking:

With this change, it would instead look like the following.

my-pipeline:
  source:
    - http:
         port: 2021
         ssl: true
    - http:
         port: 2022
         ssl: false
  buffer:
    bounded_blocking:

dlvenable commented 2 years ago

An alternative design is to continue to support a single source per pipeline. But, the ability to connect multiple pipelines could be expanded such that using a pipeline source can read from multiple pipelines.

The following diagram shows how this would work in a simple use-case where three sources share all their processing. This matches the example configure in my previous comment.

MultipleSourcesViaPipelineConnector

The YAML for this approach would be more involved:

my-pipeline-port-2021:
  source:
    http:
      port: 2021
      ssl: true
  buffer:
    bounded_blocking:
  sink:
    - pipeline:
        name: my-pipeline-combined

my-pipeline-port-2022:
  source:
    http:
      port: 2022
      ssl: true
  buffer:
    bounded_blocking:
  sink:
    - pipeline:
        name: my-pipeline-combined

my-pipeline-combined:
  source:
    pipeline:
      names:
        - my-pipeline-port-2020
        - my-pipeline-port-2021
  buffer:
    bounded_blocking:
  processor:
    # As usual

There are a few advantages to this alternative approach.

It likely fits most use-cases. The main interest I've heard from multiple sources is to combine multiple sources downstream. It is quite likely that each source will need some initial processing specific to that individual source.
This can discourage bad combinations of processors. Some sources probably should not be combined directly. For example, putting a trace source and HTTP source together would not make sense and probably leads to a bad pipeline. The approach of connecting multiple pipelines provides each pipeline the opportunity to do custom processing for that type. So for example, the trace pipeline can do its processing before sending to a pipeline that can handle logs and traces.
Multiple buffer configurations may help with prioritization. A pipeline author might want to dedicate more buffer space to an HTTP source than the S3 source since the S3 source is a pull-based source.
Putting our effort to supporting connecting multiple pipelines is probably a better use of our development efforts.

This approach has some downsides:

Adding multiple sources will be more complicated for pipeline authors if a pipeline author truly can use multiple sources without any processors.
Buffer management becomes more complicated for pipeline authors. They now have to think about how much space they want for each source and then how much for the combined pipelines.

n.b. There might be other ways to configure the pipeline source to receive from multiple sources. My goal here is more to convey a possible solution to multiple sources, not to fully design how the pipeline source could work.

opensearch-project / data-prepper

Support multiple sources for a pipeline #406