pinojs / pino

🌲 super fast, all natural json logger
http://getpino.io
MIT License
14.21k stars 874 forks source link

Applying regex to redact keys or value #1924

Closed mastersilv3r closed 6 months ago

mastersilv3r commented 7 months ago

The Problem

Given stringent statutory compliance, teams need to ensure that sensitive data is never leaked including via logs. Now consider a scenario where a developer may need to redact all values where any of the key or the value may match a regular expression.

Redact all words/patterns from logs, which match phone numbers (as keys or values) identified by the following formats - (123)-456-7890, 123-456-7890, or 123456-7890

Note: this is different from the nested keys at path feature currently supported by pino-redact

I believe this is not supported in Pino or fast-redact, because I could not find any documentation for the same. This could mean the following case for example

Example 1: by regex on key

Sometimes you may want to apply regex over a key to apply redact. This could mean the key is phone_number or my_phone_number or your_phone_number or any word which ends with string phone_number. To implement this redact (by key) instruction can continue to remain same as today except one minor difference - now an individual key can be be matched via regexes. For example:

redact: ['**.*phone_number'] # Pino supports exactly mentioned key or path names. Can we allow Pino to say all keys (whether leaf or non-leaf keys) at any depth of JSON object need to match this given regex. 

Example 2: by regex on regex on key and/or value

redact_by_value: [{
    'phone_number': '/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/',    #apply regex on values for key === `'phone_number'`
     '.*phone_number': '/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/', #apply regex on value for all keys in the JSON tree ending with `'phone_number'` 
     '.*': '/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/',  #apply regex on all value universally,
}] 

Why is this feature potentially important? Because of strict compliances every app developer needs to ensure that they do not leak sensitive data even by mistake. The current Pino redact implementation only works when developer has intentionally redacted certain keys. But developer can do mistakes. Or a developer may be working with a dynamic set of incoming stream of data not in its control. There the developer would not like to take chance. Given their unique business context they may have a fair idea that which kind of value patterns may hold sensitive information: for example phone number patterns, email patterns, some key name patters etc.

Paths for a developer

I understand there is a performance implication but assuming this is a must feature for a modern app with compliance checks, and that developer does not want to take chance even by mistake to let certain regex matching data to pass through to logs in their telemetry backends hosted on cloud or where ever, what choices do we have to support the developer's journey?

Solution 1: Redact using Pino itself (via fast-redact)

Enhance fast-redact itself and do via that?

Pros: Simplest path for Pino users :-) One places to configure everything about log redaction. One single source of truth on what to redact what not to. And that lies at the source of the log signal emission itself.

Cons: I don't see any yet

Solution 2: Redact outside Pino but within Nodejs runtime

There are two choices thought both the choices may reuse fast-redact or another utility which provides regex based redact.

Then this lies outside of purview of fast-redact I feel. You may know better.

Solution 3: Solve it completely outside Nodejs runtime

Express this transformation (redaction) logic in a log exporter like Fluentbit or FluentD daemon-set. They have good support for log transformation and regex. This approach doesn't require changes in Pino or the application layer.

Pros:

Cons:

Conclusion

I would like to hear from the community and the maintainers

mcollina commented 7 months ago

I think the argument of developers making mistakes is not worth protecting against, because a phone number (or any known other PII) can appear inside any string block. In other terms, filtering on keys is not the correct way to solve this requirement.

Adding it to fast-redact would be very hard, but I would welcome such a PR. However, it should not slow down them current benchmarks.

My recommendation for this feature is to implement this as a transport so that it's outside of the hot path and there are reduced performance considerations (fast-redact predates transports).

I would recommend:

  1. validate if it's implementable in fast-redact
  2. if not, implement a new "filtering" transport/stream that works on the full log line, ideally as a string, to filter out those values.
jsumners commented 6 months ago

Closing due to lack of response.

github-actions[bot] commented 5 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.