lakiw / pcfg_cracker

Probabilistic Context Free Grammar (PCFG) password guess generator
318 stars 71 forks source link

l33t, multiwords(DFS), monte_carlo method #18

Closed kofny closed 4 years ago

kofny commented 4 years ago
lakiw commented 4 years ago

I want to start by saying this is amazing and thank you so much for this! I'm really floored by the work you have put into this pull request. I want to incorporate much of this into the main project, so please take all comments I make with the understanding that this project is very personal to me so I'm a bit peculiar about what goes into it, and also apologies up front that I haven't made those preferences well known before this.

Another apology is that I'm not an expert in git, so I don't know the proper way to accept some of your changes, and edit others without accepting the whole pull request. I want to give you credit for the work you did, but there are certain parts I'm hesitant to include into the main repo without modifications. I'm open to suggestions on how to best do that.

At a high level, I like the changes to the context sensitive section and will likely accept all of those changes as-is.

In regards to the multi-word detector and l33t detector, one issue I've had way in the past has been how Python's re library handled different language encodings. Still, you have a working l33t detector and that is a heck of a lot better than no l33t detector which is the current state. I'll admit it may take me a while to accept this part since I really want to read through, test and understand it, but these have the potential to be a big addition so I really appreciate it.

As to the Monte Carlo section, I'm certainly going to look at it but this is where your changes run against some of my unspecified requirements, (my apologies once again). I really want to keep the number of non-core Python libraries in this toolset limited. For example, while the chardet package is highly recommended, I took care that the trainer can be run without it as well. This is to make it easier for others to get this suite operational, and the ability to run it on a plain Python3 install can be really useful in certain cases. On that note, one Python library I've gone out of my way to avoid is numpy, since while it is incredibly powerful, it also has a spotty record of being able to gracefully be installed in different environments. One option I might do is something similar like how I handled chardet, and make the monte carlo estimation optional, so there is certainly a path forward. But I wanted to explain why I'm hesitating on this section. Another hesitation, which is not fair to you, is that I've seen Monte Carlo approaches used in ways that I don't fully agree with, specifically when it comes to estimating password probability. So I'm very skeptical this approach will lead to a better password score. That being said, I am open to changing my mind, so I'll run some tests and see if this approach helps. What would be beneficial for me is if you could provide some supporting documentation as well highlighting the value of this approach.

Side note, I'll fully agree that moving to a status indicator like tqdm is a much nicer and cleaner option than what I currently have. That being said, referring back to my comment about not requiring external python modules, I want to avoid forcing people to install it, but once again it might be a nice optional option if it is already installed on the system.

To set expectations, I'll admit I've have basically taken the last 4 months off of any password security research due to the ongoing pandemic and the craziness around it. I want to get back to working on this, but I don't have the mental bandwidth available right now to show the project and this pull request you submitted, the care it deserves. That's another way of saying that I want to get your pull request integrated into this project, and I appreciate and respect the work you have put into this. But on my end work on this may be slow and sporadic, and I want to apologize for that up front.

kofny commented 4 years ago

Changes

Sorry for my poor documentation.

Changed functions and corresponding files: changes.zip

I removed numpy and tqdm in monte_carlo.py I removed pickle, re in multiword_detector.py I added a return clause when encoding is not ASCII in leet_detector.py

Thank you for your prompt reply.

Monte Carlo

A password may appear several times when we generate candidate passwords using pcfg_guesser. For example, 1q2w3e4r may be treated as K8 or K7A1.

Therefore, A password may have several probabilities.

password_scorer.py can give us only one probability, however, given one may not be the largest probability.

Thats why I find as many as possible structures (actually, not all, this is a trade off) of a password and calculate all probabilities of them and pick the largest one as the final probability.

Multiwords

helloworld can be treated as hello and world, while helloabc can be treated as hello and abc.

Therefore, detection of helloabc-like multiwords should also be applied.

The another problem is that theproblem may be treated as thepro and blem (found by my friend). Therefore I rewrite the detector using Depth First Search. By assigning a probability to all possible multiwords, we can get multiwords with largest probability.

Calculation of probability:

theproblem can be treated as [the, problem], [thepro, blem].

$prob(w) = \frac {num(w)}{num(words \ of \ length \ the \ same \ as \ w)}$

$$prob(multiwords) = \frac {\prod_i^{n} {prob(multiwords[i])}} {n}$$, n = number of elements in multiwords

Divided by n is a trick to find multiwords with less elements.

re

re used in multiwords can be replaced.

prev_chr_type = None
acc = ""
parts = []
for chr in section:
    if chr.isalpha():
        cur_chr_type = alpha
    elif chr.isdigit():
        cur_chr_type = digit
    else:
        cur_chr_type = other
    if prev_chr_type == None:
        acc = chr
    elif prev_chr_type == cur_chr_type:
        acc += chr
    else:
        parts.append(acc)
        acc = c
    prev_chr_type = cur_chr_type
parts.append(acc)
# then we get parts

L33t

L33t detector used only for ASCII encoded passwords.

Many l33t transformation have not been applied yet in L33t detector. And the next step is to apply other l33t transformations.

The problem is that I have no None-ASCII password dataset, so it's really hard for me to fix this issue. Therefore, L33t detector may be disabled if encoding is not ASCII.

2020-07-23: Now, we can use DFS to find possible l33ts. l33ts transformations here. To speedup the process, some hacks are applied. For example, if there are too many transformations (>=256) when we unleet a word, we'll return early. Another hack is that we detect whether whole password is l33t when finding all possible l33ts. Therefore, we may never find some l33ts if they always appear as part of a password.

A problem is that how to detect p1@1s1s1w101r1d1-like l33ts. Users use l33t transformation and insert some chars to l33ts. I have no idea to speedup the process.

feat-l33t.zip

lakiw commented 4 years ago

Do you have a twitter handle or some other name you'd prefer me to call you by? I want to make sure I give you credit. For example, I'll be giving a presentation/training session at the passwords village at Defcon next week, and I'd like to mention your work. Thanks again!

kofny commented 4 years ago

Deleted