open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.01k stars 2.33k forks source link

New component: Obfuscate Processor #16600

Closed atingchen closed 1 year ago

atingchen commented 1 year ago

The purpose and use-cases of the new component

The new processor will apply format-preserving encryption to obfuscate the data. Our goal is to provide a processor that can obfuscate data and restore data in some scenarios, rather than a secure encryption scheme. It is important that any code used for obfuscating telemetry data for research be widely reviewed by the community--these tools need to be well reviewed and held in community.

Related Cases

The user has exported traces and log data and hopes to analyze them with diagnostic tools. There may be some attributes that contain user data, so the current Processor can be used to blur the trace and log without destroying the format of the data.

Related to https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/13626

Example configuration for the component

obfuscation:
  encrypt_key: "some-32-byte-long-key-to-be-safe"
  encrypt_round: 128
  encrypt_attributes:
    - age
    - name
    - id

Telemetry data types supported

Traces and logs.

Is this a vendor-specific component?

Sponsor (optional)

No response

Additional context

I have written an early implementation that can obfuscate traces data in format-preserving encryption.

atoulme commented 1 year ago

Are you applying symmetric encryption? Or do you want to use asymmetric key/pair encryption? Why is encrypt-round used explicitly? Do you know what algorithms/key types you'd use here? Is there a way to deobfuscate the attributes as part of a similar processor later on? I bet this can be applied to metric attributes too, not just traces and logs. Does this apply to log bodies too?

atingchen commented 1 year ago

Are you applying symmetric encryption? Or do you want to use asymmetric key/pair encryption? Why is encrypt-round used explicitly? Do you know what algorithms/key types you'd use here? Is there a way to deobfuscate the attributes as part of a similar processor later on?

First of all, our goal is to provide a processor that can obfuscate data and restore data in some scenarios, rather than a secure encryption scheme. Second, the processor uses Feistel cipher to implement format-preserving encryption. The concept of the Feistel cipher described in Wikipedia as:

A Feistel network uses a round function, a function which takes two inputs – a data block and a subkey – and returns one output of the same size as the data block. In each round, the round function is run on half of the data to be encrypted, and its output is XORed with the other half of the data. This is repeated a fixed number of times, and the final output is the encrypted data.

An important advantage of Feistel networks is that the entire operation is guaranteed to be invertible (that is, encrypted data can be decrypted), even if the round function is not itself invertible. So there is a way to deobfuscate the attributes.

atingchen commented 1 year ago

I bet this can be applied to metric attributes too, not just traces and logs. Does this apply to log bodies too?

From my point of view, this Processor is not very useful for metrics. Because the original intention of this Processor is to obscure sensitive information that may be related to users. I don't think any attributes related to users should appear in the attributes of Metrics, otherwise it will cause high cardinality problems. If it's just some general metrics attributes, fuzzing doesn't seem very useful. Of course, this Processor can be applied to Metrics.

I don't currently process the log bodies. In order to implement this solution, we need to use regular expressions. This is a very expensive operation.

atingchen commented 1 year ago

The way to save the encryption key and the number of rounds we can continue to discuss.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

moh-osman3 commented 1 year ago

Hi @atingchen I noticed you marked this issue as completed. Was looking to help out by working on this processor but seems you might already have an implementation? Curious to know the state of things. Otherwise if you're no longer working on this issue I'm happy to take a look. Thanks!

atingchen commented 1 year ago

@moh-osman3 Hi Moh, you can give it a try. My issue has not been accepted, so the code has not been merged into the main branch. After researching, I have not found a particularly useful package for format-preserving encryption.