opensearch-project / data-prepper

OpenSearch Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
262 stars 202 forks source link

Implement Post-Processor Hooks for Refreshing Index after Ingestion #4885

Open SavvasSriAnushaVeeramachineni opened 2 months ago

SavvasSriAnushaVeeramachineni commented 2 months ago

Is your feature request related to a problem? Please describe. Currently our ETL job runs every 30 minutes and inserts a file into S3, triggering OpenSearch ingestion pipeline. Due to varying ETL completion time, it's challenging to determine suitable refresh_interval at the index level that works consistently for all scenarios.

As a result of this behavior - there is a delay in the data being available even though the ingestion to OpenSearch is complete.

Describe the solution you'd like We propose to add a new configuration option for http post-processor hooks in the Data Prepper pipeline definition, which will allow us to specify the http POST endpoint and make refresh API call( /index-name/_refresh), post pipeline ingestion is completed.

Currently the processor available in the pipeline definition only works before ingesting data to OpenSearch.

Describe alternatives you've considered (Optional) Provide refresh option at pipeline index settings which will internally refresh the index after the execution of pipeline.

Additional context N/A

dlvenable commented 2 months ago

@SavvasSriAnushaVeeramachineni , Thank you for opening this issue. I understand that you'd like Data Prepper to automatically call the _refresh API for every updated index.

Can you clarify what will try making that call? Are you using S3-scan? Do you want the completion of the scan to trigger the refresh?

As a result of this behavior - there is a delay in the data being available even though the ingestion to OpenSearch is complete.

What is your delay?

Also, have you tried using the default refresh_interval to let OpenSearch handle it?

SavvasSriAnushaVeeramachineni commented 2 months ago

@dlvenable Thanks for Replying! Regarding : Can you clarify what will try making that call? Are you using S3-scan? Do you want the completion of the scan to trigger the refresh?

What is your delay?

Also, have you tried using the default refresh_interval to let OpenSearch handle it?

SavvasSriAnushaVeeramachineni commented 1 month ago

@dlvenable Do you have any suggestion/solution for the requirement we are looking for?

Is there any plan to pick the enhancement request in the near future?

SavvasSunilBelakeri commented 1 week ago

Hi @dlvenable : Good day, do we have any traction on the above use case ?