Analyze one-time code entropy. is it enough?

callahad commented 8 years ago

Brian Warner (@warner) had some interesting thoughts on this domain in his PyCon talk on Magic Wormhole around the 12 minute mark.

Right now we choose 6 characters from an array of 48 candidates.

48^6 = 12,230,590,464 combinations, or around 33.5 bits of entropy.

If we want to be case insensitive, we'd drop to just 26 possibilities:

26^6 = 308,915,776 combinations, or around 28.5 bits of entropy.

Is this enough?

callahad commented 8 years ago

The Magic Wormhole stuff doesn't directly apply, but could still be useful context.

Since we don't have any out-of-band interaction, an attacker could force us to generate as many one-time codes as they want, and keep guessing perpetually, but with enough entropy and rate limiting (#70), we should be able to tune this to a reasonable rate of expected failure.

callahad commented 8 years ago

Okay, let's do this. I don't have an academic understanding of probability and statistics, so I may get some of this wrong. Confirmation would be appreciated.

Let's presume:

1×10⁵ requests per second (our server would melt well before this).
No rate limits (naïve schemes could likely be circumvented by a determined attacker, sophisticated schemes would not likely be maintainable at our scale).
Only one guess allowed per confirmation code.

There are roughly 3×10⁷ seconds in a year (365.25 × 24 × 60 × 60 = 31,557,600).

That means a determined attacker gets 3×10¹² guesses per year, or 3×10¹⁴ per century.

We want the attacker to have no more than a 1% chance every century, so the space of possible confirmation codes must be 3×10¹⁶, which is 100 times larger than the number of guesses.

Now we just need to solve for n in 2ⁿ = 3×10¹⁶

$ python3 -c 'from math import log; print(log(3e16)/log(2))'
54.735812018918956

So as long as we have 55 bits of entropy, we should be good. Provided I haven't made a significant error.

callahad commented 8 years ago

The thing I'm most uncertain of is the bit where I say "if we want you to have a 1% chance after n independent guesses, then the solution space needs to be 100 times larger than n." Is that how probability works?

callahad commented 8 years ago

If I'm doing this right, we'd need to stay case-sensitive and bump to 10 characters to reach that level of assurance :-/

onli commented 8 years ago

The thing I'm most uncertain of is the bit where I say "if we want you to have a 1% chance after n independent guesses, then the solution space needs to be 100 times larger than n." Is that how probability works?

~~Yes. At least that's how I remember it, and I let that confirm (note neither me nor my discussion partner are experts in stochastic).~~ though, reading through that again, I did not understand you here properly. Thinking again.

Your calculation looks alright, but it is ignoring that the OTP is only good for some time and then changes. This resets the work of the attacker, meaning we can get away with less bits.

Let's look what we currently have: https://duckduckgo.com/?q=((((31%5E8)+*+0.01)+%2F+100000)+%2F+60)+%2F60&t=ffab&ia=answer

31^8 are our combinations
to get the 1% chance of getting the right one, I'm multiplying with 0.01
now, to see how many seconds that would take when assuming 100,000 requests per second, i'm dividing by 100k
result is equal to ~1400 minutes, that's the division by 60. That is around ~23 hours.

That is how long one otp is good with one single attacker and not wanting more than a 1% chance.

I still think we can add rate-limiting here in a reasonable way, it will reduce the 100,000 significantly.

Also, if he attacks 10k email addresses of one single site he is targeting at the same time, he still has only 100,000 requests per seconds, giving him 10 requests per email address per second. This does not increase his chance of getting lucky and finding one time the right token. In this scenario, without rate limiting, he has a better chance of attacking one single address (I admit currently not knowing how to calculate properly what exactly his lowered chance is. Working on it).

onli commented 8 years ago

Only one guess allowed per confirmation code.

Okay. I did not get you there. The thing is: My calculation assumes that the otp, the code, stays the same. In your scenario the attacker changes the confirmation code each time. That moves the attacker into the worst case, as that reduces his chance of finding a code. I did not expect that ;)

The thing I'm most uncertain of is the bit where I say "if we want you to have a 1% chance after n independent guesses, then the solution space needs to be 100 times larger than n." Is that how probability works?

I don't think so. Each time you do the attack, when you reset the code each time, you have the same probability. Imagine this with small numbers: 3 independent guesses. If your solution space is 300, your chance is not 1%, it is lower – each time you had a chance of 1 in 300 to get the number right. So what we want is the probability of having at least one hit. That would be 1-(1-p)^n), which according to https://duckduckgo.com/?q=(1-(1-0.003)%5E3)&t=ffab&ia=answer is 0.009.

onli commented 8 years ago

So what we want is the probability of having at least one hit.

That is called minimal probability, at least when translating from german. With that you can calculate how many tries you need to have a given probability. ln(1-a)/ ln(1-p), with a being the wanted minimal probability and p being the probability of a hit. That gives us http://www.wolframalpha.com/input/?i=(ln(1-0.01)+%2F+ln(1-(1%2F(31%5E8)))), which looks nice, but not when comparing this with the number of tries we have per day: http://www.wolframalpha.com/input/?i=((ln(1-0.01)+%2F+ln(1-(1%2F(31%5E8))))++%2F+(100000+*+60+*+60+*+24)) = 0.99.

What does that mean? If our interpretation is right, it means that to have a 1% chance of a random hit when always changing the code you still need only one day. That's… very surprising to me.

The formula allows to play with it. To reach your 100 years, I need to go to 11 bits, well, it reaches 81 years, close enough. http://www.wolframalpha.com/input/?i=((ln(1-0.01)+%2F+ln(1-(1%2F(31%5E8))))++%2F+(x*+60+*+60+*+24*+365))+%3D+100 shows that we can stay with 31^8 when allowing a rate per second of not more than 2.71 for the 100 year goal.

callahad commented 8 years ago

Another way to check this is with R's dbinom function, which tells you the probability of getting x successes after a series of k trials, where your probability of succeeding at any one trial is p.

                   __ Number of attempts
                  /
P(x) = dbinom(x, k, p)
              /      \
 Successes __/        \__ Independent probability of each attempt

For instance, if you were trying to guess a single coin flip, you'd get it wrong .5 of the time:

> dbinom(0, 1, 1/2)
[1] 0.5

If you were guessing a series of 4 coin flips, and wanted to see how often you'd get all of them wrong:

> dbinom(0, 4, 1/2)
[1] 0.0625

Or, to flip it, what are the chances you'd guess at least one of them right?

> 1 - dbinom(0, 4, 1/2)
[1] 0.9375

To apply this to our confirmation codes:

Each guess has a probability p of 1 / n^m, where n is the number of distinct characters, and m is the length of the code.
An attacker gets as many guesses in a year, k, as r × 60 × 60 × 24 × 365.25, where r is their rate of guesses per second.

So, if we presume a rate of 100,000 guesses per second, and 26 case insensitive characters, then an attacker would have the following chance of guessing an 8-character long code over the course of a year:

> 1 - dbinom(0, 1e5 * 60*60*24*365.25, 1/26^8)
[1] 0.9999997

Doh.

Options:

Lengthen the code
Limit the rate of guessing
Change our fundamental assumptions

...Or any combination of the three. Is a 1% chance in 100 years after a sustained barrage the right metric, or is something lesser fine? Should we come at it from the other angle of "how safe can we make 8 characters?"

callahad commented 8 years ago

Oh, we didn't need to reinvent this wheel. Sandstorm is open source, and also offers an email-token login system. They choose 12 characters from a case-sensitive list of 55 options, tokens are valid for up to 30 minutes.

That means their entire solution space is 55¹², just over 69 bits of entropy, and an attacker at 100,000 requests per second could test 1.8 × 10⁸ possibilities every 30 minutes, before the token expires. They only hit a 1% risk of failure after 100,000 years. This is equivalent to 42 bits of entropy with a 2 guess / hour rate limit.

> 1 - dbinom(0, 30 * 2 * 24 * 365.25 * 1, 1.8e8/55^12)
[1] 1.235586e-07
> 1 - dbinom(0, 30 * 2 * 24 * 365.25 * 10, 1.8e8/55^12)
[1] 1.235585e-06
> 1 - dbinom(0, 30 * 2 * 24 * 365.25 * 100, 1.8e8/55^12)
[1] 1.235578e-05
> 1 - dbinom(0, 30 * 2 * 24 * 365.25 * 1000, 1.8e8/55^12)
[1] 0.000123551
> 1 - dbinom(0, 30 * 2 * 24 * 365.25 * 10000, 1.8e8/55^12)
[1] 0.001234823
> 1 - dbinom(0, 30 * 2 * 24 * 365.25 * 100000, 1.8e8/55^12)
[1] 0.01227984
> # Roughly equivalent to rate-limited 42 bits of entropy:
> 1 - dbinom(0, 30 * 2 * 24 * 365.25 * 100000, 1/2^42)
[1] 0.01188772
> # Of course, disallowing multiple guesses is still better:
> 1 - dbinom(0, 1e5 * 60 * 60 * 24 * 365.25 * 100000, 1/55^12)
[1] 0.0004117772

Do we need a full 69.3 bits of entropy? Probably not. i'd be happy with 55, which would put us in the right ballpark for a century:

> 1 - dbinom(0, 1e5 * 60 * 60 * 24 * 365.25 * 100, 1/2^55)
[1] 0.008720745
> # Though we'd need 56 or 57 bits if we wanted to allow 3 guesses per code:
> 1 - dbinom(0, 1e5 * 60 * 60 * 24 * 365.25 * 100, 3/2^55)
[1] 0.02593474
> 1 - dbinom(0, 1e5 * 60 * 60 * 24 * 365.25 * 100, 3/2^56)
[1] 0.01305256
> 1 - dbinom(0, 1e5 * 60 * 60 * 24 * 365.25 * 100, 3/2^57)
[1] 0.006547715

So, I'm back to the initial conclusion: we need 55 bits of entropy. How do we get there?

callahad commented 8 years ago

There are some word-based systems that are interesting, but I don't think any of them hit our usability needs, with the possible exception of S/Key or Mnemonic:

Seven PGP Words, for 56 bits of entropy.
Six S/Key words, for 64 bits of entropy.
Six Mnemonic words, for 64 bits of entropy.
Five Diceware words, for 64 bits of entropy.

Otherwise, we're stuck with some kind of random string of characters. We want to avoid ambiguous characters. Stack Overflow has some good suggestions, and there's even some academic research on the topic:

In research conducted at Bell Labs, some symbols were more vulnerable than others to being misidentified.1 The letters l (el) and 1, O and 0, Z and 2, and 1 and 7 accounted for more than 50% of the errors.

Other options:

Fourteen hex characters (56 bits)
Eleven characters from a 32-character set (55 bits)

I'm fond of that last option, and there are a few great alphabets to choose from, including Crockford Base32 and z-base-32.

Let's just go with one of those. I really appreciate Crockford's explicit treatment of case-, oO0-, and 1iIlL-insensitivity (oO0/1iIlL), but it still risks a few ambiguities like 5/S and 2/Z. On the other hand, z-base-32 is very well thought-out and explicitly avoids most of ambiguity, only risking confusion between 1 and i. Thus, I favor z-base-32's alphabet. And since eleven is an odd length, let's bump it to twelve.

This also gets us up into the ~3700 year range:

> 1 - dbinom(0, 1e5 * 60 * 60 * 24 * 365.25 * 100, 1/32^12)
[1] 0.0002736811
> 1 - dbinom(0, 1e5 * 60 * 60 * 24 * 365.25 * 3700, 1/32^12)
[1] 0.01007647

We could even drop up to six problematic characters while still being fine on the century timescale:

1 - dbinom(0, 1e5 * 60 * 60 * 24 * 365.25 * 100, 1/24^12)
[1] 0.008603874

Proposed resolution:

Twelve characters, chosen from z-base-32's alphabet.
Rendered as two groups of six characters.
Compared case-insensitively on our backend.

A few examples of what these might look like:

eyuhsz 9dceet, dumpgk cogmyg, scjrxx patkq8, nhcw57 trib38, 3779cm 1wuzci, w6ao4e dbk15s, z55ui9 7ejogp, jofmkr 4zhbwk, 1tienw 78xrja, 9m9btn 9te5c6, xyhbbi em5m51, z6os1a ptkbq4.

All of those look reasonable. Having vowels in there does risk vulgarity (fuckin wanker), but... variety is the spice of life?

onli commented 8 years ago

So, I'm back to the initial conclusion: we need 55 bits of entropy. How do we get there?

I don't agree. In the sandstorm 12 character link, the comment is leaning towards handling brute force attackers. And I think we should do that. We can stretch our code as far as reasonable, we could also make it ^9 or ^10 or even ^12 (formatted as 2x6, like you said). But we should handle the brute force case anyway:

It heightens our security
It can reduce load on the server

We can start with simple stuff. Something like "1 request per email per second". We can combine that with an IP rate limiting later, reasonable big for the nas scenario. Both would get the 100,000 per second in our upper bound down, and that will help a lot in here.

1 The letters l (el) and 1, O and 0, Z and 2, and 1 and 7 accounted for more than 50% of the errors.

The good thing is that the only tokens of that we can have when staying lowercase is the 1, 7 and l, as all our tokens are lowercase so far. We can use o and z (and already did add the o and some others in https://github.com/portier/portier-broker/pull/100/commits/f2029125992d34b2bb8e67b1f3528db16b765e23). Maybe we can add some more? 5 and b. Also symbols like -_+#?=()&%$§! ?

The code is a fallback, which will only be used when the link does not work and there is no OpenID login. Does not have to be the most pretty in the world.

Edit: http://www.crockford.com/wrmg/base32.html looks good. Pretty close to what we have already. The question is whether we want to have letters replacing other letters they could get confused with. Does it really help to not use 0 but O because they could get confused, when in the end the user still has to guess which one we picked?

stephank commented 8 years ago

Random thought: what if we make a full 128 bits secret for the link, and do TOTP on top of it for the code?

We can create an extra route that takes the secret and generates a GIF, which we can include in the email.

An attacker would have to find the secret, or solve the TOTP in 30 seconds.

It has its issues. What does the GIF do once the TOTP expires? Can we make it work?

callahad commented 8 years ago

@onli:

I agree, we should absolutely add some rate limiting, but my point is that 55 bits should be sufficient even without. And 12 characters gets us 60 bits, which is even better than 55.
No symbols, they're pain in the ass to type on a mobile keyboard.
I strongly prefer z-base-32 to Crockford Base32 for legibility, ease of transcription, and ease of communication. Compare some examples from Crockford's set:

8QEVW5 V4MTKD, 08WE39 E3D85X, NAQX8G X1A071, 0QN7EM PRSCHY, 6WSACF 3TGSZD

To ones from z-base-32:

om38yu jqn94j, utb5n8 kp8g5x, 79gd5r qf7nim, affc3g xf35j3, z8hwr8 yzpjpu

I have a much easier time understanding z-base-32 at a glance; it was effectively designed for exactly this purpose.
"The code is a fallback" I disagree; glancing at my phone and copying the code is my primary way of logging into the demo with non-gmail accounts.

@stephank:

I don't think we can make a TOTP work, both due to complexity in communicating it (we'd effectively need to run js in the email), and because it only has about 20 bits of entropy, meaning an attacker would only need a day at 10 requests per second to have a 56% chance of breaching a specific account. And that's presuming that the TOTP code gets invalidated every time there's a wrong guess. If we leave it open for the fully 30 seconds, a rate of 1 request per second gives an attacker a 92% chance of success over the course of a day.

As long as it's possible for entering a manually-transcribed code to be sufficient to authenticate, then the fundamental security of the system relies just on the entropy in that code.

Stepping back, any specific opposition to the 12 characters from z-base-32 proposal?

onli commented 8 years ago

Stepping back, any specific opposition to the 12 characters from z-base-32 proposal?

I got that right that the z-base 32 alphabet is all letters and numbers but 0, l, v, and 2? That would be close to what we already have and I'm perfectly fine with that.

On the 12 characters I'm conflicted. I would prefer to keep the code a bit shorter, like 10. On the other hand, I just wrote that I think the code is more a fallback (objection to that of course registered :) ). And it is nice being able to rely less on rate-limiting, also when going into scenarios you mentioned like having multiple email addresses for one target, and having a distributed attack base. So no, I also don't oppose the key length, or have a good argument against it.

callahad commented 8 years ago

I got that right that the z-base 32 alphabet is all letters and numbers but 0, l, v, and 2?

Yep!

alpha/num: 0123456789abcdefghijklmnopqrstuvwxyz
z-base-32:  1 3456789abcdefghijk mnopqrstu wxyz
(missing): 0 2                  l         v

warner commented 8 years ago

Hey, sorry for being late to the party.

probability: yeah, if you only allow one guess per email/code, then all the trials are independent with probability 1/possible_codes, and your analysis is correct: max_rate / possible_codes = mean_breaks_per_second. The binomial distribution tells you how that splits out into individual "N breaks per X seconds" bins, but it'll be dominated by the "1 break" bin (if you're talking 60 bits per code, then the chances of 2 breaks in a given period is 1/2^60 of the chance of 1 break).

This is an "online" attack, which is a lot easier to deal with than "offline" attacks (single-use short-term codes aren't generally considered interesting to defend against offline attacks, like someone breaking into the server and stealing a copy of the database, since people usually assume such attacks take a long time to use the stolen data, by which point it will have been expired).

Using server performance as a built-in rate limiting is handy, but consider what happens if/when you improve that (spreading across multiple servers, etc). Perhaps add a note that says "don't optimize this code beyond X req/sec without adding explicit throttling of some sort). You should be clear on whether the performance limit is imposed by the server's ability to accept/test guesses, or its ability to send emails with codes (and remember that an overloaded server might have tried to send a billion emails, all of which got dropped or throttled, but the server doesn't know that so it's still willing to accept the codes, so really "emails sent" isn't as much of a performance bound as "emails we tried to send").

Expiring the code after 30 minutes prevents the "leftover/ignored email gets discovered a month later by attacker" threat. I don't think I'd include it in the rate-of-break analysis though. I don't know the details of your system, but I imagine the attacker gets to provoke new codes as much as they like (i.e. there's no additional layer of authentication or rate-limiting sitting in front of this code/email/validate flow).

If you accepted unlimited codes per email, and only paid attention to an individual user, yeah, 30min * ratelimit / codespace gives you the probability of a single attempt being broken. But you care more about the chances of any user getting broken within some large time period, and there can be lots of concurrent 30min attempts taking place. (I think your "1% risk of failure after 100,000 years" might be assuming only one login per 30 minutes, but I haven't read it carefully). So I'd analyze it as a collection of independent single-use codes.

If you accept multiple guesses per email, then increase the codespace to match, to meet the same overall break rate. (that's not 100% accurate, the attacker has trivially more than a 3x chance of breaking by using 3 guesses, because the guesses are distinct and won't ever accidentally be the same, but it's negligible). (also the performance-bound rate limit might still apply, depending upon

When choosing the symbol set for the codes, consider the transcription environment. It sounds like you're doing cut-and-paste from an email (as opposed to verbal transcription like in magic-wormhole), but of course email on mobile is the traditional weak link, worst case requiring someone to briefly memorize the code while they switch apps. On alphabets, my personal feeling is that the receiving side should map all the similar characters down to the one canonical example in that class (e.g. the sender only ever uses "o", but the receiver maps all of "oO0" into "o", same for 1ilIL, maybe 5sS and 2zZ). Similarly, downcase everything before comparison. Obviously the sender should not emit semantically-distinct visually-similar characters, but if the receiver doesn't also merge them, then you're still forcing the user to play a game of "guess what the round circle means", where the receiving application emits a "bad character, please use one of [LETTERS INCLUDING AMBIGUOUS ROUND CIRCLE]" error instead of "WRONG PASSWORD, START OVER".

For transcribed codes, it's nicest if the user doesn't need modifier keys to type it in, which rules out many of the potential symbols (as well as upper-case letters). Most symbols cause word-breaks when doing double-click-to-select-word, which rules out more. Transcribing symbols is a hassle, which rules out the rest. I'd stick with some flavor of base32, or maybe even base30 if you want to avoid the problematic 5sS/2zZ cases (but implementing base(non-power-of-two) is a hassle). 5sZ/2zZ probably matters more for handwritten transcription, given the height differences of the numbers vs lowercase letters, but it also depends upon your font and how many characters are presented next to each other ("2" in isolation doesn't give you a sense of where the baseline is, or how tall the rest of the font would be).

The Signal folks recently added "Safety Numbers", which are all base10 (no letters), and I think their argument was i18n, since roman letters weren't the default for a lot of their intended userbase.

Human brains love grouping (or maybe it's that human brains hate long strings of symbols). magic-wormhole uses hyphens to join the words, but I've watched people get that wrong when transcribing it (one person says "FOUR PURPLE KUMQUAT" and the other person types "four purple kumquat" instead of "4-purple-kumquat"). If possible the "shape" of the code should be fixed, and the thing you're typing it into should know that shape, and add in the right punctuation. Also, if you do that, with separate little boxes for the sections, and the JS auto-focuses the next box after the previous one is filled, then please please also make the JS ignore spaces typed into those boxes. I've hit some really annoying web forms where my typing of the literal code (spaces/hyphens and all) interacted badly with the designers expectation that the user would forget to type the punctuation.

10 characters of base-32 is 50 bits, performance-bound rate limit of 1000 emails/guesses per second gives mean break rate of 1/1e13s = 1/360k years.

I don't see how TOTP would help you here, since you're sending an email instead of using a previously-provisioned app. (I like TOTP, don't get me wrong, but as far as I understand your use case, it doesn't seem like it fits).

callahad commented 8 years ago

Implemented in #103, closing.

portier / portier-broker

Analyze one-time code entropy. is it enough? #69