Applying regex to redact keys or value

mastersilv3r commented 7 months ago

The Problem

Given stringent statutory compliance, teams need to ensure that sensitive data is never leaked including via logs. Now consider a scenario where a developer may need to redact all values where any of the key or the value may match a regular expression.

Redact all words/patterns from logs, which match phone numbers (as keys or values) identified by the following formats - (123)-456-7890, 123-456-7890, or 123456-7890

Note: this is different from the nested keys at path feature currently supported by pino-redact

I believe this is not supported in Pino or fast-redact, because I could not find any documentation for the same. This could mean the following case for example

Example 1: by regex on key

Sometimes you may want to apply regex over a key to apply redact. This could mean the key is phone_number or my_phone_number or your_phone_number or any word which ends with string phone_number. To implement this redact (by key) instruction can continue to remain same as today except one minor difference - now an individual key can be be matched via regexes. For example:

redact: ['**.*phone_number'] # Pino supports exactly mentioned key or path names. Can we allow Pino to say all keys (whether leaf or non-leaf keys) at any depth of JSON object need to match this given regex.

Example 2: by regex on regex on key and/or value

redact_by_value: [{
    'phone_number': '/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/',    #apply regex on values for key === `'phone_number'`
     '.*phone_number': '/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/', #apply regex on value for all keys in the JSON tree ending with `'phone_number'` 
     '.*': '/^\(?(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})$/',  #apply regex on all value universally,
}]

Why is this feature potentially important? Because of strict compliances every app developer needs to ensure that they do not leak sensitive data even by mistake. The current Pino redact implementation only works when developer has intentionally redacted certain keys. But developer can do mistakes. Or a developer may be working with a dynamic set of incoming stream of data not in its control. There the developer would not like to take chance. Given their unique business context they may have a fair idea that which kind of value patterns may hold sensitive information: for example phone number patterns, email patterns, some key name patters etc.

Paths for a developer

I understand there is a performance implication but assuming this is a must feature for a modern app with compliance checks, and that developer does not want to take chance even by mistake to let certain regex matching data to pass through to logs in their telemetry backends hosted on cloud or where ever, what choices do we have to support the developer's journey?

Solution 1: Redact using Pino itself (via fast-redact)

Enhance fast-redact itself and do via that?

Pros: Simplest path for Pino users :-) One places to configure everything about log redaction. One single source of truth on what to redact what not to. And that lies at the source of the log signal emission itself.

Cons: I don't see any yet

Solution 2: Redact outside Pino but within Nodejs runtime

There are two choices thought both the choices may reuse fast-redact or another utility which provides regex based redact.

Before calling Pino: First do redaction in memory and then call Pino logging API on the redacted data.
After calling Pino: Consume from pino transport. Redact the data stream. Pipe the redacted stream to an exporter like Fluentbit which will forward to the collector.
Pro: Reusability: Can use this utility in any Nodejs project, for any purpose not limited to logging. Con: Would mean maintaining another dependency aside from Pino for logging purposes.

Then this lies outside of purview of fast-redact I feel. You may know better.

Solution 3: Solve it completely outside Nodejs runtime

Express this transformation (redaction) logic in a log exporter like Fluentbit or FluentD daemon-set. They have good support for log transformation and regex. This approach doesn't require changes in Pino or the application layer.

Pros:

Modular handling of redaction at Collector layer. Applications are black boxes for the collector's redaction layer in this approach.
This decoupled approach helps bring compliance to both legacy and new applications of the organisation.
- These applications may or may not be doing redaction before logging.
- These applications might as well be exporting logs in any formats - OTEL format or any format.
- One can apply this approach to process and transform log data from brown field application logs without changing them
This not only works for Nodejs/Pino runtimes but any other language and libraries as well.

Cons:

Not the simplest solution
More moving pieces in overall implementation and hence more complexity of setup and maintenance.
The source of log stream is still sending sensitive data to the transport stream to the collector. Developer has additional responsibility to ensure there is no persistence of sensitive data on the host OS file system.
Organisation has to maintain code and solution at two places. When the case is more of a greenfield project, it is perhaps easiest and also safe to control and redact output at source (application runtime) itself.

Conclusion

I would like to hear from the community and the maintainers

[ ] How you feel about the need of this feature for logging?
[ ] What about the feasibility to solve this via pino/fast-redact?
[ ] The performance overhead?
[ ] Which of the three paths mentioned below would you recommend or definitely not recommend? :-)

mcollina commented 7 months ago

I think the argument of developers making mistakes is not worth protecting against, because a phone number (or any known other PII) can appear inside any string block. In other terms, filtering on keys is not the correct way to solve this requirement.

Adding it to fast-redact would be very hard, but I would welcome such a PR. However, it should not slow down them current benchmarks.

My recommendation for this feature is to implement this as a transport so that it's outside of the hot path and there are reduced performance considerations (fast-redact predates transports).

I would recommend:

validate if it's implementable in fast-redact
if not, implement a new "filtering" transport/stream that works on the full log line, ideally as a string, to filter out those values.

jsumners commented 6 months ago

Closing due to lack of response.

github-actions[bot] commented 5 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

pinojs / pino