Well, I thought I'd share some analysis of this issue as we search for the best response to it.

Let's start with the underlying assumption that's rarely laid out anywhere. The secrets used by serious systems are generated using high-quality cryptographic algorithms that generate some sort of encoded or hashed output. The better the algorithm is, the more the output looks like random garbage. Because humans don't deal well with strings of random bytes, common practice is to encode the raw data as printable text. There are lots of ways to do this, but a couple of the most common are hexadecimal representation (two characters per byte) and base64 (more efficient, using four characters to represent three bytes).

Instead of laboriously checking the entire content of a file for entropy, truffleHog and tartufo checked entropy only for strings that were hexadecimal or "probably" base64 encodings. (This could include strings that weren't actually legal encodings but still used only the base64 alphabet.) This is much more efficient, but ignores anything that doesn't happen to be using one of those two encoding methods.

177 requested support for base64url encoding, a variant that uses `-` and `_` instead of `+` and `/` and which is easier to deal with in contexts (like URLs) where the original base64 alphabet is problematic. Our problem is foreshadowed by the RFC section title: "Base 64 Encoding with URL and Filename Safe Alphabet".

The approach incorporated in 3.0.0 was to relax the base64 scanner slightly so it would recognize both base64 and base64url (as well as things that were illegal combinations of both) using a single regular expression. This satisfied #177 -- base64url encodings would be detected and entropy-checked -- and had negligible performance impact because we did not materially alter the algorithm for scanning input strings. It also didn't turn up any problems in our initial testing.

The gotcha, as has been quickly revealed following 3.0.0's release, is that it is extremely common for long file and path names (among other things) to contain - or _ characters in them. Previously these names would have been split at each hyphen or underscore (because these are not part of the base64 encoding alphabet) and the individual name fragments would either be too short or have too little entropy to generate an issue. Now, much longer strings are evaluated and they often have entropy just high enough to trigger an alarm.

The question is, "what do we do about it?" Obviously we'd like tartufo to magically ignore all of the "obviously nonrandom" strings while correctly reporting all of the actually random strings (and do it as fast or faster than before). It is not clear this is possible. I wanted to outline the approaches we have considered and why all of them suck.

First off, there is the issue of "obviously nonrandom". Humans are pretty good at looking at something like b/Standards-Best-Practices/Speed-of-Delivery/TEMPLATE and saying "oh, that's a filename". However, by the measure tartufo uses -- basically Shannon entropy -- there are enough different characters used in this string that the information content ("entropy") is suspiciously high and tartufo generates an issue.

The approach taken by #319 is to observe that this string obviously can't be a real encoding, because it contains / (which is illegal in base64url) and - (which is illegal in base64). If we considered the string as base64 (like tartufo 2.x) we'd examine strings b/Standards, Best, Practices/Speed, of, and Delivery/TEMPLATE -- all of which are shorter than 20 characters and therefore ignored. If we considered the string strictly as base64url, we'd examine strings b, Standards-Best-Practices, Speed-of-Delivery, and TEMPLATE -- of these, Standards-Best-Practices and Speed-of-Delivery are long enough for more careful examination, and both fall safely below the "suspicious entropy level" threshold.

Of course, it's not quite that easy. Many alphanumeric strings (that don't contain - or _ or + or /) are valid encodings in both alphabets, and we don't want to generate duplicate issues. Therefore the new algorithm takes care to eliminate duplicate strings and not create issues for substrings of strings that already generated issues.

The price we pay for this is an extra pass over the data (to look for base64url encodings, distinct from the base64 pass) and the added deduplication recordkeeping -- hopefully largely offset by the elimination of a pass over the data (to split lines on whitespace before checking for encodings) and regex changes to short-circuit production of too-small strings that would be immediately discarded anyway.

However, in real life we have already found examples of filenames that still exceed the entropy threshold and generate issues. Therefore, from a strict compatibility viewpoint, #319 sucks. It just seems to suck less than the alternatives.

What are the alternatives?

We could drop base64url support, but that sucks. We shouldn't ignore possibly sensitive data (using industry-standard encoding, even!) and we know at least one project actually uses these encodings and wants to check them. We could add a switch to enable or disable base64 support, but that sucks, too. Maybe a repository owner doesn't realize that there is base64url-encoded content present. Maybe they get a bunch of "spurious" issues driven by filenames and turn off the check, allowing legitimately problematic content to pass unscanned. It feels like too big a hammer.

Several observers have suggested changing the default --entropy-sensitivity setting. Using a higher number increases the amount of entropy that must be observed before an issue is generated, and evidently works for some people. The problem with changing the default is that a higher number may not be generally appropriate. First off, the origin of the existing default -- 75 -- remains a mystery. It's theoretically possible for a very high entropy hexadecimal string to look like a "high-enough" entropy base64 string and generate duplicate issues if the sensitivity is lower than 67 (a problem #319 also addresses), so maybe 75 is a "round" number safely above that cutoff. But is there a reason it isn't higher? Was it just an educated guess? Nobody seems to know.

However, we have examples of real-life filename fragments that have entropy equivalent to real-life AWS_SECRET_ACCESS_KEY strings. That is, if we set the bar high enough to avoid false positives on these filenames, we get false negatives on sensitive strings we definitely want to know about (and which existing tartufo versions will report). From a security perspective, this cure is worse than the disease, and that sucks.

Another option we investigated was to build on the concept of "that filename obviously looks nonrandom". Could we make these obvious to tartufo also? I'd love to hear from somebody with a workable approach, but the things we've considered suck. The crux of the problem is that "that filename" is still a valid encoding with high entropy.

From a statistical viewpoint, we played with looking at sequencing -- ABCDE and EDCBA don't look very random, while ACEDB looks more random. They all have the same (high) entropy, however. A non-mathematically-rigorous test that looks how many times each character in the string is smaller or larger than its predecessor can spot the difference, in these examples, but flopped in real-life testing. We found examples both of known-sensitive encodings we wished to flag and obviously harmless filename components that resulted in basically equal scores, making such a test not very useful. You can find descriptions via Google about randomness tests that examine the distribution of runs, but they typically operate on sequences multiple orders of magnitude larger that what tartufo encounters -- also not useful.

Alternatively, we could say "well, it's a filename" -- but realistically we don't really KNOW it's a filename; we just have human intuition. The purported name doesn't necessarily exist in the repository, so it's not like the scanner can go look and say "yay, I found it so I guess it's probably a reference and not some encoding that just looks like a filename!" Maybe it's something that exists in a different repository or is present in the runtime environment where the repository's generated artifacts will be deployed. So, in the absence of any specifics, this approach sucks too.

That brings us, reluctantly, to the point of saying "this is a file, it's okay" and "this is another file, it's okay too" and... wait -- those are exclusions. Tartufo already knows how to do that. But can it do it well enough? We already have efficiency concerns without adding even more exclusions, and that's an ongoing cost piled on top of the one-time cost of adding them. But that's a story for another issue or PR.

godaddy / tartufo

Entropy scanning producing extra (incorrect) outputs #315

🐛 Bug Report

Expected Behavior

Environment