opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
262 stars 195 forks source link

Index Mapping Updates Through OSIS Pipeline Configuration YAML #5038

Open bircpark opened 1 week ago

bircpark commented 1 week ago

Is your feature request related to a problem? Please describe. Currently an OSIS pipeline seems to require either manual intervention or downtime to be taken when updating the mappings for an index, this includes adding subfields to an already existing field or a brand new field entirely.

Existing Configuration For Mapping

"field_name": {
    "type": "text",
    "fields": {
       "keyword": {
          "type": "keyword"
       }
    }
}

New Configuration For Mapping

"field_name": {
    "type": "text",
    "fields": {
        "english": {
              "type": "text",
              "analyzer": "english"
            },
         "keyword": {
            "type": "keyword"
       }
    }
}

The english subfield is not shown in the cluster and requires downtime or manual changes to be used.

Describe the solution you'd like It would be nice for OSIS pipelines to have the ability to update index mappings when they are updated in configuration. Once the updates are made having something like an update_by_query call or something similar to populate the new fields.

Describe alternatives you've considered (Optional) a) A manual change to the mapping with an invocation of the update_by_query API to backfill records b) Take some downtime to stop the pipeline, delete the index, then restart the pipeline to re-sync data

Additional context The solution suggested is mainly concerned with updating subfields as update_by_query will only populate subfields of already existing fields and won't work for brand new fields being introduced to the mapping. For entirely new fields to the mapping you would need to run something else run (maybe like a Glue Job) to have the documents update reliably.

dlvenable commented 6 days ago

This partially depends on #973. But, it would also need an ability to update the OpenSearch index.

dlvenable commented 6 days ago

@bircpark , What source are you using in this case?

bircpark commented 5 days ago

@dlvenable, My source is Dynamo DB using the Zero-ETL pipeline integration.

dlvenable commented 12 hours ago

Based on my understanding, the ask here is for Data Prepper to make a call to PUT <index>/_mapping to update the actual mappings file based on the user-defined input. This will allow modifications to an existing index as new fields are added.

bircpark commented 11 hours ago

Yes that is correct.