opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.02k stars 1.67k forks source link

[FEATURE] Eventually Enriched Data - asynchronous ingest pipelines #9591

Open hijakk opened 10 months ago

hijakk commented 10 months ago

Copying from https://github.com/opensearch-project/ml-commons/issues/1162 as it is applicable to opensearch more broadly. Splitting into this feature request and https://github.com/opensearch-project/OpenSearch/issues/9590

Problem definition

Ingest pipelines supporting both locally hosted and remotely deployed models as discussed here are relatively new Opensearch features and provide flexibility in hooking in resource intensive processes and/or remotely available processes. From what I can tell, these are all intended to be synchronous processes - a record that's intended to go through an ingest pipeline must go through the ingest pipeline before it is exposed.

As these are resources intensive processes, either for local compute or making remote calls, there's a risk of the data ingest process being significantly disrupted by unexpected spikes in volume, or remote services being unreachable.

More broadly, and applicable to OpenSearch in general, if a complex ingest pipeline stalls for any reason, it can significantly disrupt cluster operations.

From a data engineering/ETL standpoint, this is a double edged sword - while it's very nice for Opensearch to support a larger range of resource intensive data transformations natively, off normal use cases could result in data loss or delays as upstream ETL processes attempt to insert more data into Opensearch through particular pipelines than is possible.

Assuming ingest pipeline circuit breakers are implemented, we're left with a new challenge - inconsistently enriched data.

Solution

Asynchronous ingest pipeline application - allow application of ingest pipeline logic to already indexed data

When it comes to, say, text vectorization, it's certainly valuable to have it done, but the velocity of data can vary significantly. If data velocity were to spike and saturate the ingest resources, according to the ingest pipeline circuit breakers feature request in https://github.com/opensearch-project/OpenSearch/issues/9590, we would automatically fall back to a simpler ingest pipeline until load has reduced.

However, now we've got a mixture of records that have been through the "full" ingest pipeline, and those that haven't.

A process which can automatically identify and apply "full" ingest pipelines to collections of data would permit either automatic or manual triggered reprocessing and "full" enrichment of data that didn't get it on initial ingest. Spreading out the ingest workload so that the data is "eventually enriched" would permit smooth+automatic load balancing for heavy ingest workloads while still completing all enrichments necessary without delaying surfacing of data or causing upstream ETL process queues.

Other alternatives

Other approaches would seem to come down to a bunch of scripting in the ETL process, handling of edge cases, robust queue systems, etc. This can be a non-trivial task depending on data and ETL process complexity.

sandervandegeijn commented 10 months ago

I would like this as well, especially for large/spiky ingest volumes this could decouple stuff. For a SIEM-application it would be fine if data is added at a later time, as long as it's added at all (within a reasonable timeframe).