opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
250 stars 184 forks source link

S3 source - allow cross region access #4470

Open brianmaresca opened 4 months ago

brianmaresca commented 4 months ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. It would be nice to have [...] Currently, I do not see any way to have a single pipeline consume an s3 source (with sqs) for s3 buckets that are in different regions. It would be nice to have this ability.

Example scenario:

With the above configuration, everything goes smoothly for us-east-1. However, the pipeline fails to get objects from the us-west-2 bucket because the s3 client is configured for us-east-1. The (not very informative) error log is: [s3-source-sqs-1] ERROR org.opensearch.dataprepper.plugins.source.s3.SqsWorker - Error processing from S3: null (Service: S3, Status Code: 400, Request ID: xxxx, Extended Request ID: xxxx)

Describe the solution you'd like Enable (or the option to enable) cross region access on the S3 client so it is able to download objects from buckets in regions other than the one defined in the yaml config. See https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/s3-cross-region.html.

potential solution, add .crossRegionAccessEnabled() to createS3Client in S3ClientBuilderFactory:

    public S3Client createS3Client() {
        LOG.info("Creating S3 client");
            return S3Client.builder()
                .crossRegionAccessEnabled(true)
                .region(s3SourceConfig.getAwsAuthenticationOptions().getAwsRegion())
                .credentialsProvider(credentialsProvider)
                    .overrideConfiguration(ClientOverrideConfiguration.builder()
                            .retryPolicy(retryPolicy -> retryPolicy.numRetries(5).build())
                            .build())
                    .build();
    }

Describe alternatives you've considered (Optional) Using a pipeline and sqs queue for each bucket that is in a different region. But this feels silly - extra sqs queue, pipeline, and duplicated configuration.

dlvenable commented 4 months ago

@brianmaresca , Thank you for creating this detailed issue. It seems you are familiar with the solution. Would you be interested in creating a PR contribution for it?