opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
255 stars 188 forks source link

S3 buffer using pipeline transformations #4809

Open dlvenable opened 1 month ago

dlvenable commented 1 month ago

Is your feature request related to a problem? Please describe.

For workloads that are smaller and want durability, using S3 as a buffer can be a good solution.

Describe the solution you'd like

Data Prepper already has a few things that we can combine to create an S3 buffer.

  1. An S3 source
  2. An S3 sink
  3. Pipeline transformations

I propose that we have a new buffer - pipeline_s3 which is implemented only as a pipeline transformation.

my-pipeline:
  source:
    http:
  buffer:
    pipeline_s3:
      bucket: mybucket
  sink:
    - opensearch:

This would transform into:

my-pipeline-source:
  source:
    http:
  buffer:
    bounded_blocking:
  sink:
    - s3:
        bucket: mybucket

my-pipeline-sink:
  source:
    s3:
      scan:
        buckets:
          - bucket:
               name: mybucket
  buffer:
    bounded_blocking:
  sink:
    - opensearch:

Describe alternatives you've considered (Optional)

We could implement an S3 buffer similar to the Kafka buffer that does not require splitting the pipeline. But, creating this would be quite a bit faster.

Also, I think we should leave room for a possible S3 buffer that is implement. My proposal is to alter the name of this buffer to make it distinct from an S3 buffer. And also to avoid confusing with other buffers such as Kafka. Thus, I called this pipeline_s3.

One alternative to changing the name is to use a flag instead - split_pipeline: true or asynchronous_buffer: true.

Additional context

N/A

kkondaka commented 1 month ago

David we probably need some kind of partitioning mechanism (using folders) and make sure items in a partition are processed in order.