trufflesecurity / trufflehog

Find, verify, and analyze leaked credentials
https://trufflesecurity.com
GNU Affero General Public License v3.0
17.28k stars 1.71k forks source link

Improving TruffleHog's Entropy Calculation #168

Closed Seancarpenter closed 2 years ago

Seancarpenter commented 5 years ago

Although the core algorithm to determine Shannon Entropy is correct, the data set that TruffleHog uses (a string containing all of either the Base 64 characters or Hex characters) that's used to determine character probability is highly ineffective. In order for Shannon Entropy to be calculated effectively, we need a probability distribution table that more accurately reflects the probability of each character actually appearing in a string. In a previous experiment, I had attempted to build that probability distribution table by first ingesting the repository that is being scanned. While I found that implementation to be more effective than the solution that TruffleHog currently implements, I ultimately found better success with a different solution: using the string provided as the single source of probabilistic truth. My findings are detailed below:

For reference, I've provided the original entropy function (labeled as "truffle_entropy" here, along with its two entry points:

def truffle_entropy(data, iterator):
    if not data:
        return 0
    entropy = 0
    for x in iterator:
        p_x = float(data.count(x))/len(data)
        if p_x > 0:
            entropy += - p_x*math.log(p_x, 2)
    return entropy

shannon_entropy(string, BASE64_CHARS)
shannon_entropy(string, HEX_CHARS)

A more accurate implementation (which I've labeled as "single_entropy") that uses the string as the single source of probabilistic truth is detailed below:

def single_entropy(s):
    pd_table = {c: s.count(c) / len(s) for c in s}
    entropy = 0
    for x in s:
        entropy += -pd_table[x] * math.log2(pd_table[x])

    return entropy

When comparing the two using a decently long string, single_entropy identifies the secret key as having a much higher entropy, while truffle_entropy can barely differentiate:

BASE64_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/="

secret = "2b95710rD1e6287e69Z8f2E24373449d879b70c7601B3x9"
long_string = "ThisIsAReallyLongString"

single_entropy(long_string) > 5.620762806516582
single_entropy(secret) > 11.722104499210204

truffle_entropy(long_string, BASE64_CHARS) > 4.088779347361362
truffle_entropy(secret, BASE64_CHARS) > 4.1152539475979655

Although these results are far from conclusive, I would ask that we reevaluate how TruffleHog calculates entropy as I'm not convinced that the way it is currently being done is very effective.

As alluded to at the beginning of this post, I believe that we can ultimately achieve better success with a probability distribution table that is generated from real code (i.e. scanning through existing repositories). However, this change seems like an easy win that doesn't introduce any additional overhead, so I wanted to bring it up.

domanchi commented 4 years ago

Hi @Seancarpenter,

Author of detect-secrets here. Despite this post being pretty old, I was finally able to take some time and digest this great proposal you have. I think you may be onto something, but if I'm understanding this correctly, I think your proposed algorithm needs some tweaking.

To demonstrate my case, let me state my three assumptions that should hold true for any form of entropy calculation:

  1. The max entropy limit is obtained when every character in the string is different (e.g. abcd)
  2. The min entropy limit is obtained when every character in the string is the same (e.g. aaaa)
  3. With every unique character that is added to the "min entropy limit" case, we would expect the entropy value to rise. The converse is also true: with every non-unique character we add to the "max entropy limit" case, we would expect the entropy value to drop. (e.g. calculate_entropy("aabc") < calculate_entropy("abcd"))

In your proposed single_entropy function however, this third point does not hold true.

import string
data = list(string.ascii_lowercase)
for index in range(len(data)):
    data[index] = 'a'    # gradually make all characters homogenous
    print(''.join(data), improved(''.join(data)))

This produces:

abcdefghijklmnopqrstuvwxyz 4.700439718141092
aacdefghijklmnopqrstuvwxyz 4.908165850305791
aaadefghijklmnopqrstuvwxyz 5.236515710539404
aaaaefghijklmnopqrstuvwxyz 5.639104203436979
aaaaafghijklmnopqrstuvwxyz 6.083539410088697
aaaaaaghijklmnopqrstuvwxyz 6.544845161151518
aaaaaaahijklmnopqrstuvwxyz 7.002673448183524
aaaaaaaaijklmnopqrstuvwxyz 7.439848341829596
aaaaaaaaajklmnopqrstuvwxyz 7.841506433115373
aaaaaaaaaaklmnopqrstuvwxyz 8.19454606983194
aaaaaaaaaaalmnopqrstuvwxyz 8.48725291584906
aaaaaaaaaaaamnopqrstuvwxyz 8.709033667786388
aaaaaaaaaaaaanopqrstuvwxyz 8.850219859070549
aaaaaaaaaaaaaaopqrstuvwxyz 8.901919101925264
aaaaaaaaaaaaaaapqrstuvwxyz 8.8558995949762
aaaaaaaaaaaaaaaaqrstuvwxyz 8.704498654828098
aaaaaaaaaaaaaaaaarstuvwxyz 8.440549034026825
aaaaaaaaaaaaaaaaaastuvwxyz 8.057318690597437
aaaaaaaaaaaaaaaaaaatuvwxyz 7.548460920107216
aaaaaaaaaaaaaaaaaaaauvwxyz 6.907972600397635
aaaaaaaaaaaaaaaaaaaaavwxyz 6.130158878672838
aaaaaaaaaaaaaaaaaaaaaawxyz 5.209603039707737
aaaaaaaaaaaaaaaaaaaaaaaxyz 4.141140588342356
aaaaaaaaaaaaaaaaaaaaaaaayz 2.9198367950063555
aaaaaaaaaaaaaaaaaaaaaaaaaz 1.5409671133507974
aaaaaaaaaaaaaaaaaaaaaaaaaa 0.0

We can see that this has a parabolic nature, which certainly does not follow the intuition of entropy in strings. As such, I think your algorithm may require some tweaking to be effective.

prabhu commented 3 years ago

I noticed that all entropy calculators seems to be using an incorrect base of 2. base 2 is for bytes where as for ascii and hex the base should represent the number of combinations so should be 256 and 16.

https://en.wiktionary.org/wiki/Shannon_entropy

rbailey-godaddy commented 3 years ago

FWIW, while doing the work on tartufo #274 mentioned above, I boiled down the basically original algorithm to this optimized version:

from collections import Counter
def new_calculate_entropy(data: str) -> float:
    """Calculate the Shannon entropy for a piece of data.

    This essentially calculates the overall probability for each character
    in `data` to be to be present. By doing this, we can tell how random a
    string appears to be.

    Borrowed from http://blog.dkbza.org/2007/05/scanning-data-for-entropy-anomalies.html

    :param data: The data to be scanned for its entropy
    :return: The amount of entropy detected in the data
    """
    if not data:
        return 0.0
    frequency = Counter(data)
    entropy = 0.0
    float_size = float(len(data))
    for count in frequency.values():
        probability = float(count) / float_size
        entropy += -probability * math.log2(probability)
    return entropy

Testing this alongside the original algorithm yields identical entropy results. The problem with the proposed algorithm above is that for x in s is not deduplicating s -- with an input of testing the contribution of t gets used twice, when it should have been used just once.

rbailey-godaddy commented 3 years ago

I noticed that all entropy calculators seems to be using an incorrect base of 2. base 2 is for bytes where as for ascii and hex the base should represent the number of combinations so should be 256 and 16.

https://en.wiktionary.org/wiki/Shannon_entropy

As you point out, "base 2 is for bytes" and the strings you're looking at are putatively ASCII bytes. (The thought of extending this to Unicode makes my brain hurt.) In that sense, the computation is correct.

The trick is, after the fact, to scale the entropy based on the actual bit rate (so to speak) of whatever representation you're looking at. For example, a hexadecimal string is composed of characters that each represent a base-16 digit, i.e. a 4-bit number. Base64 representations use 4 (8-bit) characters to represent 24 bits of input, so each character represents 6 bits of data. Full 8-bit ASCII (or raw data) represents 8 bits of data per byte.

Tartufo (or TruffleHog, from which it is derived) scales the entropy trigger level for each string based on what it is believed to represent (i.e. base64 or hex encodings) in order to determine whether or not a finding should be generated.

dustin-decker commented 2 years ago

Hey there, we've just released the next major version of TruffleHog!

It is a complete rewrite that scans more data sources and now supports detecting and verifying over 600 credentials. Please check it out when you can.

https://trufflesecurity.com/blog/introducing-trufflehog-v3

We're no longer maintaining v2 so I am closing this issue.