Question about lambda greedy calculation

Hi @rchatterjee,

I really like your work on typo correction! I read your 2016 paper and I've been digging through the code to try and understand it better.

I am curious about how the security loss lambda q greedy is calculated for the various checkers. After solving the best-q-guesses problem in your experiment, you sum the probability of the union ball for every password in the best greedy guesses:

https://github.com/rchatterjee/mistypography/blob/f0fb62cdc42bcd2f4e0881cdeaccfa640edd0b20/security/compute_sec_loss.ver1.py#L207

https://github.com/rchatterjee/mistypography/blob/f0fb62cdc42bcd2f4e0881cdeaccfa640edd0b20/security/compute_secloss.py#L30

I understand that the union ball would be the checked ball for the always checker but this isn't the case for the blacklist & optimal checkers. It seems to me that lambda q greedy should be calculated using the checked ball with typofixer.check(password) | set([password]).

Looking forward to hearing back from you!

So, if I understand correctly, you are asking why does the \lambda_q^greedy take union over the guesses? The ball(tpw) denotes the set of all real passwords, for which tpw is a valid typo. Now, if the attacker guesses tpw, it will get an advantage equivalent to sum([p(rpw) for rpw in ball(tpw)]). This is exactly what will happen for q=1. Now extend this to q>1, we need to take union of balls, which is done by typofixer.get_ball_union, which you can find in this line.
Does this clarify your doubt?

Also, typofixer.check(password) | set(password) is not correct, as password is a string, and set(password) will create a set with the characters from the password.

Thanks for your answer! It didn't quite clarify what I'm confused about, so allow me to rephrase :smile:

Why does lambda q greedy take the union ball instead of taking the union of the checked passwords and the password itself?

Say we have q = 1 and we are using the blacklist checker with the blacklist shown below. Let's use top 3 correctors swc-all, swc-first, rm-last. The attacker is an exact knowledge attacker and knows the password distribution. For the sake of this example let's say that after solving the greedy weighted max heap coverage problem, the attacker guesses "rockyou2".

We have typofixer.get_ball_union(["rockyou2"]) = ["Rockyou2", "ROCKYOU2", "rockyou", "rockyou2"] but if the attacker submits rockyou2 the checked passwords will be typofixer.check(tpw) | set([tpw]) = ["Rockyou2", "ROCKYOU2", "rockyou2"]

Notice how rockyou isn't checked under the blacklist checker because it is in the blacklist. Since it's not being checked, I'm confused about why it's probability is included in the calculation for the security loss lambda q greedy.

typofixer.check(tpw) | set([tpw]) is also how the weights are calculated https://github.com/rchatterjee/mistypography/blob/f0fb62cdc42bcd2f4e0881cdeaccfa640edd0b20/security/compute_sec_loss.ver1.py#L28

10 most frequent passwords in rockyou

123456
12345
123456789
password
iloveyou
princess
1234567
rockyou
12345678
abc123

Edit: I corrected the previous question to use set([password]) instead of the set(password)

Sorry for the late reply. Let me see if I understand your question this time. If not, I will be happy to jump in a short Zoom call sometime next week. It's been a long time since I have closely looked at the code.

I think you are right: The get_ball_union function should use self.check instead of sefl.get_ball. The check function is cognizant of Blacklist, etc., but the get_ball is not.

Thanks a lot for pointing that out. I will really appreciate it if you can test and submit a pull request.

Rahul

On Thu, Feb 18, 2021 at 2:36 AM Philippe Partarrieu < notifications@github.com> wrote:

Thanks for your answer! It didn't quite clarify what I'm confused about, so allow me to rephrase 😄

Why does lambda q greedy take the union ball instead of taking the union of the checked passwords and the password itself.

Say we have q = 1 and we are using the blacklist checker with the blacklist shown below. The attacker is an exact knowledge attacker and knows the password distribution. For the sake of this explanation let's say that after solving the greedy weighted max heap coverage problem, the attacker guesses "rockyou2".

We have union_ball("rockyou2") = ["Rockyou2", "ROCKYOU2", "rockyou", "rockyou2"] but if the attacker submits rockyou2 the checked passwords will be typofixer.check(tpw) | set([tpw]) = ["Rockyou2", "ROCKYOU2", "rockyou2"]

Notice how rockyou isn't checked under the blacklist checker because it is in the blacklist. Since it's not being checked, I'm confused about why it is included in the calculation for the security loss lambda q greedy.

typofixer.check(tpw) | set([tpw]) is also how the weights are calculated in the first place during the experiment https://github.com/rchatterjee/mistypography/blob/f0fb62cdc42bcd2f4e0881cdeaccfa640edd0b20/security/compute_sec_loss.ver1.py#L28

10 most frequent passwords in rockyou

123456

12345

123456789

password

iloveyou

princess

1234567

rockyou

12345678

abc123

Edit: I corrected the previous question to use set([password]) instead of the set(password)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rchatterjee/mistypography/issues/2#issuecomment-781170346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACCEW7XUVKIFGJA2HWYJYDS7TGQDANCNFSM4XX3LLXA .

rchatterjee / mistypography

Question about lambda greedy calculation #2