opensearch-project / data-prepper

OpenSearch Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
262 stars 202 forks source link

Add sinks under extensions #4799

Open kkondaka opened 3 months ago

kkondaka commented 3 months ago

Is your feature request related to a problem? Please describe. Add sinks under extensions so that it can be used as a sink for pipeline DLQ, or as a sink for Live Debug Event capture or to reduce duplicate information if the same sink is used multiple times (Customers finding it unnecessary to add the same sink configuration when multiple sub-pipelines are used in the configuration).

Describe the solution you'd like Allow sink configuration like the following under extensions section

extensions:
   sinks:
     opensearch-sink1:
        hosts: [ <OPENSEARCH-ENDPOINT> ]
        aws:
            region: "<region>"
            sts_role_arn: "<ARN>"
       index: <index1>

pipeline:
   source:
   processor:
   sink:
      - opensearch:
          use: opensearch-sink1
          route:
              - route1

      - pipeline:
           name: pipeline2
           route:
                - route2
pipeline2:
   source:
          ...
  processor:
  sink:
     - opensearch:
            use: opensearch-sink1
            index: <index2>.  # override some config as needed

Using dynamic sink as pipeline DLQ or live debug capture sink

extensions:
   sinks:
     s3-sink1:
         bucket: <bucket>
        object_key:
          path_prefix: pfx
        threshold:
          event_collect_timeout: 5s
          event_count: 10
        aws:
          region: "<region>"
          sts_role_arn: "<arn>"
  dlq:   # This is RESERVED name
      s3:
         use: s3-sink1

live_capture: # This is RESERVED name
     s3:
          use: s3-sink1

Describe alternatives you've considered (Optional) key words used in the above proposal like "sinks" and "use" can be different or more appropriate names may be used.

Additional context Add any other context or screenshots about the feature request here.

dlvenable commented 3 months ago

@kkondaka , I do believe that we have some pain points in our current approach that we can try to improve.

For the problems with shared sinks:

  1. We have an existing proposal to re-use the "connection" part of OpenSearch sinks in #2590. I think we should be looking along these lines because if you are sending data to multiple sinks now, there is some variation in the sinks. As noted in the other issue a common difference will be the target indexes and mappings.
  2. We have another proposal for OpenSearch coordination in #2589. The idea here is that multiple sinks can share _bulk requests to the OpenSearch domain to reduce load. And we could also improve backpressure this way.

For pipeline DLQs, I think we should continue with the proposals in #3857 rather than try to make this an extension only. One major advantage to that approach is that the DLQ is a pipeline itself, allowing for mutations before writing to the final DLQ sink.

kkondaka commented 3 months ago

@dlvenable I agree with #2590 proposal but instead of having a new connection in pipeline configuration, I am suggesting that we should put it in extensions I think #2589 is slightly different issue and really configuration issue in my opinion

3857 I remember this. We can move dlq and live_capture from extensions to a the pipeline but still use some thing like use: <sink-from-extension>

I think basic idea is to define a "sink" in a common place and re-use it instead of specifying it in multiple places in the config