shannon entropy lib code

GoogleCodeExporter commented 9 years ago

> What features would the enhancement add?

calculation of shannon entropy (bits/char [float] as well as whole-word total 
entropy [int])

> What parts of the project would this effect?

None.

It is just for sw developers so they can display the total entropy to users 
when they define a new password or use it in a password validator.

It's not a perfect or great solution (as there can be none, see docstring), but 
maybe sucks less and is more well-defined as primitive or made-up password 
quality metrics quick hacks usually invented out there for these purposes.

> Please provide any additional information below.

I have some code to contribute (if you are interested), see the codereview 
there:

https://codereview.appspot.com/8706044

Original issue reported on code.google.com by Thomas.J...@gmail.com on 21 Apr 2013 at 4:00

GoogleCodeExporter commented 9 years ago

[my apologies ahead of time for the length of this response]

That'd fit in wonderfully with a password-generation module I'm working on for 
the next release (I should really get that branch cleaned up and pushed to the 
public passlib repo).  As far as your code goes, it looks good, and I'm 
definitely interested in contributions.  That said, I'd already been thinking 
about adding a password-strength estimator to that module, and my main problem 
is that like you stated, it's hard (if not impossible) to find a good solution, 
particularly because of the context-dependant nature of entropy calculations. 

I've been doing idle reading on the topic on and off for a while, and I feel 
like every algorithm I came across would grossly over- or under- estimate the 
strength of select passwords.  This would in turn give users either a false 
sense of security, or cause them to ignore the warning entirely because they 
thought the measurement was wrong.  If this was compounded by the application 
*forcing* it's assumptions on the user, it just made things worse. 

In particular: a direct Shannon entropy calculation is certainly the most 
transparent way of calculating password strength, implementable without the 
need for case studies and complex language- and charset- dependant metrics.  
But I'm uncomfortable particularly because of that property... I feel like it 
gives a false sense of security, since for example it would report "12345678" 
and "ydc4wZFQ" as being equally strong.

Not that I've had or found any better ideas (and this paper on guessing entropy 
made me much less hopeful of any good prospects -- 
http://reusablesec.blogspot.com/2010/10/new-paper-on-password-security-metrics.h
tml).  But I wanted to lay out why I'm hesitant to add this (or any) strength 
measurement code, until I feel more confident of how good or bad it's 
recommendations will be :) 

Which does finally lead me to one question about your code -- how did you 
arrive at the "weak" and "strong" entropy levels? For my part, the main 
guideline I've been looking at is NIST 800-63 
(http://csrc.nist.gov/publications/nistpubs/800-63/SP800-63V1_0_2.pdf), but 
that paper attaches so many other security assumptions I'm not sure if it's 
entropy recommendations are actually worthwhile for this case.

Original comment by elic@astllc.org on 28 Apr 2013 at 12:49

Changed state: Discussing
Added labels: Milestone-Release1.7, Type-Enhancement

GoogleCodeExporter commented 9 years ago

some notes:

a) a guy who is much more deep into crypto than I am noted that I am NOT 
computing shannon entropy with the original code (because that always depends 
on the generating system, which we do not know enough about). this is why I 
updated the codereview accordingly.

b) the current code can likely only tell when a password is weak or maybe-weak. 
it can't really tell when it is strong. so maybe i should rename strong_limit 
to maybe_weak_limit (that would be also more consistent with the weak_limit). 
due to that, it can not be used as only metrics for deciding about strong 
passwords, but maybe it can be used to reject some very weak passwords and warn 
about maybe-still-too-weak passwords. I think it fits for the goal to be better 
than just len(p) as metrics.

c) those levels defined in the function header were just made up after some 
experimentation. i threw some pwgen passwords at it and adjusted so they aren't 
usually classified as maybe-still-to-weak.

Original comment by Thomas.J...@gmail.com on 28 Apr 2013 at 12:07

GoogleCodeExporter commented 9 years ago

I hope you don't mind, I've gone ahead and merged your code into the 
passlib.pwd module. Happy to accept more patches if you're still revising it 
though. Everything should be in the main passlib hg repo (as well as my 
bitbucket mirror, which might be easier to fork -- 
https://bitbucket.org/ecollins/passlib). 

I'm hoping to get some more time by this weekend to work on the password 
strength estimation some more, I've got a few more algorithms I'm going to try 
to work in, and some related API changes.

Thanks for the patch!

- Eli

re: a) My understanding was that what strength() calculates *is* a form of 
Shannon entropy: the "self information" entropy, measuring the minimum number 
of bits of information that need to be conveyed to transmit the string, without 
contextual knowledge of it's generator. But my knowledge of information theory 
is far far from perfect... I'm going to go study a good information theory 
textbook before the next passlib release, and make sure everything is correct :)

re: b) I like the renaming away from "strong", it's definitely more accurate 
this way. I think the more general problem is that the current strength() test 
(as well as others I've seen) don't report an exact classification, but rather 
a potentially open-ended range with lower and upper bounds... e.g. "2" means 
"maybe-weak or better". I'm hoping that after implementing a few more 
measurement algorithms, a more generalized representation will pop out.

Original comment by elic@astllc.org on 7 May 2013 at 4:45

Changed state: Started

GoogleCodeExporter commented 9 years ago

just a quick note about b): i wanted a discrete value with not too much 
possible values so it can be rendered e.g. on a web ui using colors or stars or 
whatever. if one needs a float, one can call the inner function.

Original comment by Thomas.J...@gmail.com on 12 May 2013 at 12:11

GoogleCodeExporter commented 9 years ago

Some new code - tries to use self-entropy as well as identifying unicode 
character ranges used for generation of the password. Using both seems to be a 
better metric for pw strength, see the doctests.

https://codereview.appspot.com/9142045

In case you review it, leave comments there.

Original comment by Thomas.J...@gmail.com on 12 May 2013 at 9:37

glic3rinu / passlib

shannon entropy lib code #48