Open ksanderer opened 1 month ago
If we are on AWS, one potential interim solution is using S3 Event triggers (S3 + Lambda) to automatically send data to the Python service whenever a new file is uploaded. This would reduce the manual step of specifying file keys and streamline the ingestion process. However, this might not help in non-AWS environments.
For environments like DigitalOcean, exploring existing third-party services for file ingestion that can be hooked into S3-compatible storage solutions might save you from reinventing the wheel.
Instead of downloading files from S3-compatible storage to a local service and then uploading them to OpenSearch, stream the file directly to OpenSearch. This would eliminate the need to hold the entire file in memory.
I would like to give it a try!
Since we’re already using S3 and OpenSearch, adding a new technology for file ingestion could make things more complicated than necessary. OpenSearch has a file ingestion API, and since S3 is widely used as a modern filesystem, it makes sense to take advantage of this.
By ingesting files directly from S3 URLs, we simplify the process, reduce the need for extra services, and make better use of what we already have in place. This approach is both efficient and scalable, without adding unnecessary complexity to our stack.
Hi @ksanderer ,
Thank you for the insights! I agree that minimizing complexity is crucial. While I understand the advantages of using the OpenSearch file ingestion API and ingesting files directly from S3 URLs, I still believe that exploring automation options, such as S3 Event triggers in AWS or integrating third-party services in non-AWS environments, could enhance our workflow. These approaches might streamline the process even further and help reduce manual intervention.
I’m particularly interested in how we can implement streaming directly to OpenSearch, as it could optimize our memory usage and overall efficiency.
Hi @dblock! What does this comment signify...? Can you please explain
Sorry for the cryptic comment :) Check out https://github.com/opensearch-project/.github/pull/233, does this help?
Is your feature request related to a problem? Please describe
It's frustrating that we can't use S3 directly in ingestion pipelines. We must first load a file from S3-compatible storage, base encode it, and then push it to the OpenSearch API.
It should be possible to use direct S3 links (e.g., s3://{bucket}/path_to_file.pdf) or provide an S3 key to ingest the file directly.
Describe the solution you'd like
Instead of fetching files from S3 and pushing them to OpenSearch using separate tools (e.g. python service):
We can push files directly to OpenSearch:
The idea is to use a predefined S3 bucket, similar to snapshot repositories that can be configured to use S3 storage.
Related component
Other
Describe alternatives you've considered
Amazon offers an SQS-powered solution, but it's not available on other platforms like DigitalOcean OpenSearch.
We currently use a small Python service for this purpose. It receives an S3 key, fetches the file, and pushes the content to the OpenSearch cluster.
Additional context
No response