opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
256 stars 188 forks source link

S3 Scan Source partition supplier creates partitions in memory and a failure causes no partitions to be created #4608

Open graytaylor0 opened 3 months ago

graytaylor0 commented 3 months ago

Is your feature request related to a problem? Please describe. As a user of s3 scan, I have a bucket with 100 million objects. The current s3 scan source is not able to handle this many objects, as it is bottlenecked by returning all objects as a list of partitions in the supplier, which can lead to out of memory errors. Additionally, if there are any failures in s3 scan supplier, no partitions will get created because all partitions are returned from the supplier before they are created in the coordination store.

Describe the solution you'd like I would like the PartitionSupplier functions to be able to pass partitions back to the source coordinator for creation. So as objects are found during a scan, instead of holding them all in memory, the call to create the partition would be made right after the object is found from scanning.

Describe alternatives you've considered (Optional) A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

dlvenable commented 3 months ago

@graytaylor0 , Are you planning on working this?

graytaylor0 commented 3 months ago

@dlvenable I am not planning on working this right now

dayandersen commented 1 month ago

Encountered what I think to be this issue, would there be logs available in the CloudWatch logs to verify if I'm falling into this situation?