nickdeis / eslint-plugin-no-secrets

An eslint plugin to find strings that might be secrets/credentials
MIT License
137 stars 5 forks source link

real entropy #4

Closed oprogramador closed 3 years ago

oprogramador commented 4 years ago

IMO Shannon entropy isn't a good measurement because a given string repeated 100 times has the same entropy as repeated only once. Of course, repeating the same sequence doesn't increase much the amount of information but in some level increases.

IMO:

https://www.shannonentropy.netmark.pl/calculate

oprogramador commented 4 years ago

or another example - according to Shannon, the entropy of a is 0 and the entropy of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa is 0 as well

oprogramador commented 4 years ago

Or: abcdefabcdef -> 2.58496 abcdefcbafed -> 2.58496

oprogramador commented 4 years ago

Or: 01 -> 1 0100001011110101000000010000100100001100110101001100011101110100110101011110110111110001110111110100 -> 1

oprogramador commented 4 years ago

01 -> 1 00001 -> 0.72193

nickdeis commented 4 years ago

Hey @oprogramador, Thank you for the compelling issue. I'm currently researching into this. I have added this plugin to a few of the larger projects I work on. I think the current problem is that the false positives tend to be actual words. This isn't an issue until you have large inline strings with things like paragraphs (like auto-gen) docs. I'm currently trying to think of a good solution to this. Let me know what your thoughts are. I'm going to keep brainstorming. Maybe some NLP? Cheers, Nick

oprogramador commented 4 years ago

@nickdeis

that's my solution https://github.com/oprogramador/eslint-plugin-no-credentials/blob/master/src/calculateStrongEntropy.js

multiplying the Shannon entropy plus 1 and zipped data length minus 20 (because it's always at least 20)

oprogramador commented 4 years ago

you can see the results here https://github.com/oprogramador/eslint-plugin-no-credentials/blob/master/src/tests-mocha/calculateStrongEntropy.js

nickdeis commented 4 years ago

Super interesting. Wouldn't entropy and compression rates be colinear? I suppose this ends up being a weighted measure of entropy and string length. Any reference material used to come up with this?

nickdeis commented 3 years ago

Closing as over a year old

oprogramador commented 3 years ago

@nickdeis

I invented my own approach in my library to have a relatively good measurement of information quantity.