log: sensitive data redaction

bobheadxi commented 2 years ago

Raised by @varungandhi-src :

A couple of days back, I'd flagged some security related concerns with how we aren't very careful around logging api.RepoName values today. Swift's logging is an interesting bit of prior art in this space:
When you include an interpolated string or custom object in your message, the system redacts the value of that string or object by default. This behavior prevents the system from leaking potentially user-sensitive information in the log files, such as the user’s account information. If the data doesn’t contain sensitive information, change the privacy option of that value when logging the information. In the following code example, the system redacts the account information in the first log message, but displays the user’s selection in the second log message:
logger.log("Paid with bank account \(accountNumber)")   // Redacted!
logger.log("Ordered smoothie \(smoothieName, privacy: .public)")  // Visible
If we had such a solution, I think it would significantly reduce the risk of accidentally leaking private information in logs.

Even if having out-of-the-box support for redaction is out-of-scope for now, I think we should have this in mind as something that needs to be tackled in the future. I skimmed through zap's docs and I didn't see any mention of redaction. So if we migrate to zap and that makes implementing redaction support difficult, that would be an undesirable outcome IMO.

For a broad and generalized implementation, looks like it's not super easy but has been done (by none other than Matt Holt):

Elegant way to filter fields? uber-go/zap#750

https://github.com/caddyserver/caddy/blob/master/modules/logging/filterencoder.go

Similar reference: uber-go/zap#547 (comment)

all assuming we go with Zap, to which there has been no objections so far

bobheadxi commented 2 years ago

Alternatively, we now re-export everything via https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/lib/log/fields.go - we could simply, for example, replace log.String with an implementation that does some redaction, and introduce log.PublicString if we want an equivalent to the Swift logging example raised

varungandhi-src commented 2 years ago

we could simply, for example, replace log.String with an implementation that does some redaction, and introduce log.PublicString if we want an equivalent to the Swift logging example raised

From a long-term perspective, something like that would be nice-to-have IMO. From a short-term perspective, I'm a little concerned that such a big change would negatively affect incident response, because too many strings would be redacted.

Here's an alternate idea for a more staged rollout.

Introduce two types: (say in a new privacy package)
- A Privacy enum with 3 cases: Private = 0, Unknown = 1, and Public = 2.
- A type Text struct { data string; privacy Privacy }. The fields are kept (heh) private to avoid direct modification from outside, you need to use helper functions. The data field will be available through a helper function with "unchecked" or similar in its name indicating some level of danger.
Replace all log.String(key, expr) calls with log.Text(key, privacy.NewText(expr, privacy.Unknown)). Initially, privacy.Unknown text values would not be redacted.

Tweak the log.String API in a way that documents why it is not available, and prevents someone from "accidentally" introducing it in the future. Here's a strawman idea:

type UseFuncTextInstead { unused struct{} }

// String is a function that should not be implemented. It's presence would increase
// the risk of accidentally log user-private data, making it visible to a site-admin.
//
// Use Text instead.
func String(_ string, _ UseFuncTextInstead) {
    panic("do not implement this!")
}

Gradually audit the usages of privacy.Unknown in the codebase, and update those to either plumb the proper Privacy value from context, or set the right value (Private or Public) (aside: it'd be fun to use a Code Insights dashboard to track this!). Once we know for sure that there are no privacy.Unknown values (or very few), we can make it so that privacy.Unknown text values are also redacted.

Q: Why introduce a new type instead of only working with functions? A: Because using a type gives better guarantees. Eventually, we want to start using that type in more places, so that the same field of a struct is logged consistently (with or without redaction) in different places.

If you are on-board with this suggestion, I can try to come up with a rough API and maybe even submit a PR as time permits. Happy to discuss/brainstorm ideas over a call too. I'm out for the next 2 days though as I'll be traveling.

varungandhi-src commented 2 years ago

@bobheadxi, I don't think the "close" is quite right in:

(Although, maybe you didn't select "close", it just came up automatically?)

The linked PR (as it stands) won't fully resolve this issue because we still have 350+ uses of log.String from opentracing-go/log. I have some ideas on how we could address that; for example, we could create a shim analogous to how we shim zap (we still need to construct values of type opentracing-go/log.Field). But I don't want to tackle that in the same PR because that PR is already quite big.

bobheadxi commented 2 years ago

It's just the only way to get a concrete "link" between issue and PR that is visible in the GitHub UI, if it doesn't fully resolve we can just re-open this issue after the fact to link multiple PRs :)

bobheadxi commented 2 years ago

Update - @vrto working on collecting more specifics on the business need side of things to address questions raised in https://github.com/sourcegraph/sourcegraph/pull/36300 , e.g. https://github.com/sourcegraph/sourcegraph/pull/36300#issuecomment-1148378630 and https://github.com/sourcegraph/sourcegraph/pull/36300#issuecomment-1145806577

Slack thread

keegancsmith commented 2 years ago

I'll raise something here I mentioned in the linked PR, I think encoding privacy somehow into the key name will give us the most bang for the buck. Simple for a program to do redaction on structured logs if done this way. Depending on the logger, it feels like it is something that should be possible.

bobheadxi commented 2 years ago

Note that sensitive information shows up in errors as well, e.g. https://sourcegraph.slack.com/archives/C1JH2BEHZ/p1655314729793979 - a key-name-based approach would not provide protection against this, unless we choose to redact entire errors (which could be problematic for debugging). Protection against this is one nice property of the privacy.Text approach in https://github.com/sourcegraph/sourcegraph/pull/36300 - we will need to address this as well

sourcegraph / sourcegraph-public-snapshot

log: sensitive data redaction #33473