problem about backoff implementation

matteodellamico / montecarlopwd

Monte Carlo password checking

Apache License 2.0

11 stars 4 forks source link

problem about backoff implementation #2

Open yymmsong opened 4 years ago

yymmsong commented 4 years ago

The backoff implementation here seems a bit inconsistent with Katz's and Ma et al's backoff models. Line 114 in backoff.py:

for c, p in zip(parent_state.transitions, parent_state.probabilities):
    trans_probs[c] = trans_probs.get(c, 0) + p * missing

This statement adds some lower-order probability to every transition probability, but Katz's backoff model would only back off to lower order grams if not enough history of current gram is observed. Something like

lowp = 0
for c, p in zip(parent_state.transitions, parent_state.probabilities):
    if not c in trans_probs:
        lowp = lowp + p

for c, p in zip(parent_state.transitions, parent_state.probabilities):
    if not c in trans_probs:
        trans_probs[c] = p * missing / lowp

would be more reasonable here, I guess.

matteodellamico commented 4 years ago

Hi! Thanks for the comment, it's great to see that this code is useful and used!

Would you be able to do a pull request for this?

matteodellamico commented 4 years ago

I wonder if you could also check whether the same problem happens in LazyBackoff...

yymmsong commented 4 years ago

Hi, thanks for your reply. Your code is really helpful for new guys in password security like me.

I just looked into LazyBackoff code and noticed similar (but not excatly the same) problem. In LazyBackoff, it seems like probabilities coming from the same state will not always add up to 1, because parameter \alpha in https://en.wikipedia.org/wiki/Katz%27s_back-off_model is not properly set (missing denominator). I'll re-check whether this is the case and see if I can fix the problem.

matteodellamico commented 4 years ago

Thanks for checking, please let me know what you find out! Honestly, it's code I haven't touched for a long time so it's not trivial to remember what it does :)

matteodellamico commented 4 years ago

BTW, I'm curious about what you're working on--if you can share that. And of course, please let me know if/when you have something that's publicly available!

yymmsong commented 4 years ago

I'm working on so called "honey objects" like honeywords, etc. but recently as a side project I'm investigating how different n-gram smoothing algorithms would impact the performance of password cracking models. I'll make my work public soon as i finished it.

yymmsong commented 4 years ago

Hi, I just modified BackoffModel and LazyBackoff to match Katz's and Ma's definition, but as it's unclear whether the modification improves or degrades performance (it's probable that the implementation here is better than Ma's, because interpolation is generally better than pure backoff), maybe it's better to leave the code this way until further evalutaion. You can check or merge the change in my forked repo.

Also, results given by BackoffModel and LazyBackoff in the original implementation seem to differ, maybe because of the subtle differences in smoothing.