opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
259 stars 189 forks source link

Support Limiting Array Entries in Hash-Map Values using Data Prepper Plugin #4858

Open venkachw opened 1 month ago

venkachw commented 1 month ago

Is your feature request related to a problem? Please describe. I have JSON object in S3 with two fields like below and I want to limit the entries of those fields while uploading it to open search using ingestion pipeline.

"file" : { # file is a hash-map and keys in the file attribute are pre-defined
"read": ["value1","value2", ......"value1000"],
"write": ["value1","value2", ......"value1000"],
"delete": ["value1","value2", ......"value1000"],
}
"Scan": { # scan is also a hash-map and keys in the Scan attribute are dynamic and they are not fixed
"8080": ["1.2.3.4:8080", "5.6.7.8:8080", ..... "value1000"]
"450": ["1.2.34.5:450", ...."value1000"]
}
}

I want to limit the entries of the attributes(file.read, file.write, file.delete, scan.key1, scan.key2) to first 10 elements due to latency issues as elastic search has latency issues while querying large arrays.

Describe the solution you'd like Need the data looks like below after processing using data prepper plugin

file.read: ["value1",... "value10"],
file.write: ["value1",... "value10"],
file.delete: ["value1",... "value10"],
scan.8080: ["value1",... "value10"],
scan.450: ["value1",... "value10"]
dlvenable commented 1 month ago

@venkachw , As maps are generally unordered, how would you propose choosing the order for "first N" values?

venkachw commented 1 month ago

@dlvenable, in our case, order is not our priority.

dlvenable commented 4 weeks ago

@venkachw , So then it would pick an arbitrary N values?

One option would be to update the select_entries processor to support selecting specific keys within a object. Perhaps it would have a new select_from value that determines where to select entries from.

select_entries:
  select_from: file
  include_keys: [read, write, delete]
select_entries:
  select_from: scan
  include_keys: [8080, 450]

Would this meet your use-case? Or do you just want an arbitrary limit on certain maps and arrays?

venkachw commented 4 weeks ago

My use case is to limit the arbitrary values for certain maps and arrays, and the scan field has dynamic keys, so even for this case, I need to limit the values