opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
255 stars 188 forks source link

Data Prepper custom plugin for OpenSearch Ingestion service #4849

Open soghoyanaws opened 3 weeks ago

soghoyanaws commented 3 weeks ago

Is your feature request related to a problem? Please describe. In our DynamoDB (DDB) table, we have documents that have fields like this:

delivered_at|d87e56e8-f52f-474f-ad18-155b2a08f680: 1722622017.993797

Where the string behind the | is a random string (UUID) and the value is a float (representing a timestamp). We'd like to extract the value from this field in DDB and index it in OpenSearch as simply delivered_at: 1722622017.993797

Describe the solution you'd like Is it possible to plugin custom code, that will

rename 'delivered_at|d87e56e8-f52f-474f-ad18-155b2a08f680: 1722622017.993797' to 'delivered_at: 1722622017.993797' ? 

Describe alternatives you've considered (Optional) There is a rename_keys processor (https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/rename-keys/). But it currently can only handle static key names. So with this config, pipeline can rename the key, but if the uuid changes, rename won't work. processor: - rename_keys: entries: - from_key: "delivered_at|d87e56e8-f52f-474f-ad18-155b2a08f680" to_key: "delivered_at" > -

Tried a number of different configurations with the grok pipeline processor and none of them have worked. The challenge is that some of the pipeline processors support Pipeline Expressions, while others do not, and it's not well-documented which ones do and which ones don't.

Additional context Add any other context or screenshots about the feature request here.

dlvenable commented 2 weeks ago

Thank you @soghoyanaws for raising this issue. If I understand the problem, the key itself has no well-defined name that we can use. Correct?

If so, we need to support mutating the key name more dynamically. The number of possible key names could be just as varied as the possible values in them. In this case it is a key that starts with a value. In other situations it may be a JSON string.

soghoyanaws commented 2 weeks ago

Hi @dlvenable , correct, the key name is dynamic since it depends on UUID.

sdhull commented 1 day ago

Thanks for opening this issue for us @soghoyanaws (this is a problem from my company 😬). I personally would have liked to rename these keys but unfortunately that would be prohibitively expensive due to the size of the DynamoDB table.

Using regexp matching for target keynames in processors across the board would make all processors much more powerful. This something I was wishing for in many cases.