opensearch-project / data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.
https://opensearch.org/docs/latest/clients/data-prepper/index/
Apache License 2.0
259 stars 190 forks source link

Ability to one way hash specific attributes in payload #4602

Open mishavay-aws opened 3 months ago

mishavay-aws commented 3 months ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. It would be nice to have [...]

As a user of DataPrepper obfuscate processor, I would like to have the option to hash sensitive data fields, so the same value of the fields has a predictable value that can be searched on without revealing the data in clear.

Note: This is different from masking, which is available as part of the existing obfuscate processor.

Describe the solution you'd like A clear and concise description of what you want to happen.

Update existing obfuscate processor (or create a new one) that can take optional (seed/salt) and produce a one way hash using common hash functions SHA-* and others

Describe alternatives you've considered (Optional) A clear and concise description of any alternative solutions or features you've considered.

There is a possibility of invoking a remote function, but it will be expensive and performant for processing/hashing messages at volume.

Additional context Add any other context or screenshots about the feature request here.

kkondaka commented 3 months ago

@mishavay-aws could you provide more details on this? Like what all SHA functions you would like to be supported? MD5, SHA-2, SHA-3 all of them or just of them is good to start with?

kkondaka commented 3 months ago

@mishavay-aws it would help greatly if you could provide an example too. How is the hash key provided during the config, and so on.

mishavay-aws commented 3 months ago

I suggest starting with SHA-512 and SHA-256, as these are most frequently used. Others can be added based on feedback from the community and supporting use cases.

The value of certain fields cannot be displayed in clear when messages are stored in either S3 or OpenSearch, and those fields need to be aggregated/grouped/filtered by the same token value.

The call to the SHA function needs to take the seed/salt value set either in the configuration or dynamically based on an attribute in the document/message (eg. master record id, datasource id) and the string value that needs to be hashed.

@kkondaka Let me know if I can provide additional clarifications.

dlvenable commented 3 months ago

We made the obfuscate processor accept different plugins for obfuscation. So we could add a new action to the obfuscate processor to do this.

https://github.com/opensearch-project/data-prepper/blob/ad92aa25b80a4951176f8c45c475dc6d704c2e75/data-prepper-plugins/obfuscate-processor/src/main/java/org/opensearch/dataprepper/plugins/processor/obfuscation/ObfuscationProcessorConfig.java#L30-L31

@mishavay-aws , Would you be interested in making a contribution to do this? You would need to implement a new ObfuscationAction. You can follow the pattern shown for mask here:

https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/obfuscate-processor/src/main/java/org/opensearch/dataprepper/plugins/processor/obfuscation/action/MaskAction.java

mishavay-aws commented 3 months ago

I will look into this @dlvenable.