Does a uniquely decodable word list provide a false sense of security?

sts10 / uniquely-decodable

Some implementations of the Sardinas–Patterson algorithm in Rust

MIT License

2 stars 1 forks source link

Does a uniquely decodable word list provide a false sense of security? #1

Open bwbug opened 2 years ago

bwbug commented 2 years ago

Not to take away from the interesting work that you have been spearheading towards the goal of making uniquely decodable word lists, but I wanted to raise a bit of a counter-point, as food for thought:

The premise of these efforts is to ensure that you guard against entropy loss that can occur if a delimiter-free passphrase of k words from your word list can be decoded as a different concatenation of m words from the list, where m < n. For example, the ability to decode a 5-word passphrase as a 4-word passphrase created from a 7776-word Diceware list would reduce the passphrase entropy by almost 13 bits.

Allow me to throw a wrinkle into the above premise: An adversary is not required to use the exact same word list that you used to create your passphrase! For example, they may use something like the Oxford 3000 list or any other corpus of commonly known words. If you are a high-value target up against a nation-state threat actor, they may use your known word list plus several other lists; likewise, if your passphrase has fewer words than it should, it would be vulnerable to attack using expanded word lists. Therefore, in practice, the problem you face is to guard against the possibility that your passphrase can be decoded into words on the attacker's list, not that it can be decoded into other words on your own list.

For example, say you have a 3-word passphrase spar-kin-puts from a uniquely decodable word list, but you choose not to use delimiters (or capitalization patterns): you get sparkinputs. Now, if your adversary uses a word list that contains the words spark and inputs, now all of a sudden, your entropy has gone down significantly. In fact, if you have a 6-word passphrase constructed from a 7776-word diceware list, then an attacker would still be at an advantage if they use a word list that adds 38,880 additional words to your uniquely decodable list, if by so doing, they can decode your passphrase into a smaller number of words.

Perhaps there's some aspect of this that I haven't thought through properly, so I am open to discussion and counterarguments. For now the bottom line (at least for me) seems to be that word delimiters should always be used. And if you do use word delimiters, then your word list can safely include prefix and suffix words, so fewer words to eliminate!

Thoughts?

sts10 commented 2 years ago

I appreciate the question! I'm not an expert, so I could have fundamental misunderstanding!

I'm not 100% I have an answer to this particular question, but I can share some thoughts.

First of all, there are of course many ways an attacker could guess sparkinputs -- likewise many ways a user could select that password. They could have gotten it as 2 words from a list of 10 words (6.6 bits) or of 1 million (39.9 bits). They could have asked a password generator for 11 random letters, numbers, and symbols (65.9 bits), or from just lowercase letters (~49 bits). They could be a spark-plug specialist! Heck, the attacker could be armed with the exact word list the user used and guess words randomly and get it on the first try. We can't prevent attackers from having dumb luck. Instead, I think the best we can do is deal with probabilities (we're asking: what are the chances than attacker guesses daytimerockslidegrievance?).

One protection against this problem is to ensure that a list is "above" what I call the Brute Force Line -- this at least covers the "just lowercase letters" attack. Could an attacker guess sparkinputs early on in their lowercase letter brute-force attack? Yes. But we can at least know that those odds were less than them guessing spark + inputs using our word list.

What about an attacker using a different word list? How can that affect this kind of luck?

Let's start with the case where the attacker is using a word list that is a super-set of the one the user used (by which I mean it's a list longer than the one the user used, but contains all of the words that the user's list has). I think we describe such a case:

If you are a high-value target up against a nation-state threat actor, they may use your known word list plus several other lists

Now, if your adversary uses a word list that contains the words spark and inputs, now all of a sudden, your entropy has gone down significantly... In fact, if you have a 6-word passphrase constructed from a 7776-word diceware list, then an attacker would still be at an advantage if they use a word list that adds 38,880 additional words to your uniquely decodable list, if by so doing, they can decode your passphrase into a smaller number of words.

I think it's clear that if an attacker is using a longer word list, it's going to take that much longer to brute-force a password. In this case, our passphrase entropy is actually higher than log₂(list length). In more concrete terms, if an attacker is cranking through 40,000⁶ possible passphrases, I'm feeling a lot safer than if they know I used the EFF list and only have a space of 7,776⁶ to get through. [EDIT: My bad math.]

(Now of course, if they're using a list that's our list plus, rather miraculously, "sparkinputs" then we're in trouble. I guess I just call this luck I can't defend against?)

What about if the attacker is using a shorter word list that's a subset of our user's list? Maybe "spark" and "inputs" happen to be on the 2,000-word list they're using in their brute force attack. Or another example: we had "news" and "paper" on our uniquely decodable list and got "newspaper" as 2-word passphrase, but the attacker happens to have "newspaper" on their list. Again, this is luck we can't prevent. Of course if one of your words is not on their list, you're in a good position.

Another way to think about this: If you're an attacker and you knew for sure that I have a 6-word passphrase I randomly generated from the EFF list, how would you go about attacking it? I contend you'd start a brute-force attack using words from the EFF list and no words not on the EFF list.

Would you add words? I guess you're saying, "I bet I can add a 7,777th word that is two words of my passphrase combined." Sure, you might add "newspaper" but I'm just as likely to have "papernews" or "newselephant".
Similarly, it'd be a strange strategy to "gamble" by taking, say, 2,000 words from the EFF list and only trying combinations of those words. In a sense you're just splitting the chance of the guess into two steps (the initial subset selection, then the brute-force guessing among that subset).

How does this relate to uniquely decodable word lists? Maybe the idea is that using a passphrase without a delimiter (even from a uniquely decodable word list) increases the attacker's chances of "getting lucky." I don't know!

bwbug commented 2 years ago

@atoponce Welcome to the discussion. FYI, despite his claims of not being an "expert", @sts10 is quite well versed in the intricacies of prefix codes, suffix codes, and uniquely decodable codes, and has already produced some impressive tools for making word lists that are free of such entropy-reducing problems (see the repo for his Tidy project). You might also find the discussion in Issue #13 of the Tidy repo to be interesting.

sts10 commented 2 years ago

Ha, I tagged @atoponce on Mastodon to get his input on this question. Hoped he could lend a fresh perspective.

As a point of clarification: I've come to see prefix codes as being a subset of "uniquely decodable codes". As an example, a suffix code can have prefix words in it but still be a "uniquely decodable code". (Here's an example.) Another way is to make all code words the same length!

As I wrote in the blog post bwbug links to:

In the past, I conflated two things. I thought saying a word list was “free of prefix words” was the same as saying it’s uniquely decodable. But I now see that removing all prefix words is simply one method of guaranteeing that the resulting code is uniquely decodable.

Understanding this difference is important for making room in our thinking for alternate procedures of creating uniquely decodable lists.

In other words, the neat thing about prefix codes that we like for our word lists -- effectively what we've been chasing -- is that they are uniquely decodable (and can safely -- we think -- be concatenated without a separator). And if you think about it, "uniquely decodable" is a nice term for it: we want combinations of words, without separators, to be "decodable" in one and only one way. Removing prefix words is one procedure for making a word list uniquely decodable, but importantly it's not the only way. For this reason, I've tried to use "uniquely decodable" as the goal, rather than "prefix code".

Now, back to our hypothetical attack!

If an adversary has a much larger list that is able to prefix one word with another...

A hopefully clarifying question: So are we saying that if we randomly select 2 words from a list with "news" and "paper" on it (but not "newspaper"), it's worse "luck" to get "newspaper" as opposed to "papernews" because "newspaper", since it happens to be an English word, is more likely to be on an adversary's list of words they're brute-forcing through? And thus it's always safer to separate words, e.g. "news-paper"?

Also, unless I'm missing something, an adversary "[prefixing] one word with another" sounds the same as what any adversary would do if they had your exact word list and knew you didn't use a separator: slamming all the words together in each possible combination.

Thanks for an informed discussion!

bwbug commented 2 years ago

@sts10

First of all, there are of course many ways an attacker could guess sparkinputs -- likewise many ways a user could select that password.

I think that it is a pretty universal assumption in these types of security analyses that the attacker knows that you are using a passphrase, knows the word list, and knows the number of words in the word list. That's how we're able to quantify the entropy of a 6-word passphrase drawn from a 7776-word word list as 6·log₂7776 = 12.925 (which is just another way of saying that the attacker is guaranteed to find your passphrase by hashing at most 7776⁶ = 2.2×10²³ guesses).

In more concrete terms, if an attacker is cranking through 6⁴⁰⁰⁰⁰ possible passphrases, I'm feeling a lot safer than if they know I used the EFF list and only have a space of 6⁷⁷⁷⁶ to get through.

I think you meant 40000⁶ vs. 7776⁶. My point was that if the attacker's superset word list contains prefixes, suffixes, or other "problematic overlaps" for words in the target passphrase, then they can get away with search only for 5-word combinations, so we're really comparing 40000⁵ vs. 7776⁶. And guess which set is smaller (i.e., faster to check): it is the attackers 40000⁵ set! The attacker's superset word list would have to contain 46,656 words for the number of 5-word passphrases to break even with 6 words from the 7,776-word list.

To know that you're immune against such an attack, you would have to ensure that your delimiter-free passphrase could not be decoded into words that appear in the superset word list.

What about if the attacker is using a shorter word list that's a subset of our user's list? Maybe "spark" and "inputs" happen to be on the 2,000-word list they're using in their brute force attack.

That's a great question, and one that I have actually thought about a bit. I will summarize my thoughts on that topic some other time and place though, since it is not related to uniquely decodable codes or to delimiters.

bwbug commented 2 years ago

@atoponce

If an adversary has a much larger list that is able to prefix one word with another, thus reducing the security by ~13 bits, the passphrase still has 65 bits (5 words) or 78 bits (6 words).

That is why one of the scenarios in which the superset attack is relevant is if you're a high-value target, and/or the attacker is a nation-state threat actor -- in which case it's not impossible that the resources required to crack 65 bits may be leveraged in an attack. However, even for an "average" user, if they have gone through the trouble of selecting their passphrase strength based on a rational analysis of their threat model, then they could reasonably be expected not to be pleased if their passphrase entropy drops by double digits due to an unexpected attack mode!

Another scenario is one in which a garden-variety hacker with a few GPUs has a leaked copy of a password hash database and wants to go after the low-hanging fruit. They may design an attack targeted at users who have a 4-word diceware passphrase, by creating a superset word list, and they could expand their word list size almost 20-fold before they lose their advantage.

So the requirement here would be the passphrase generator not allowing stupidly low security margins

Good advice, but probably not realistic.

@sts10

A hopefully clarifying question: So are we saying that if we randomly select 2 words from a list with "news" and "paper" on it (but not "newspaper"), it's worse "luck" to get "newspaper" as opposed to "papernews" because "newspaper", since it happens to be an English word, is more likely to be on an adversary's list of words they're brute-forcing through? And thus it's always safer to separate words, e.g. "news-paper"?

Yes and yes! And especially because we would have to be able to determine by inspecting our passphrase whether any alternative decoding can produce a word that is likely to exist on some superset word list. In the case of newspaper, this may be easy to catch, but other times we may miss the presence of an alternative decoding. So absolutely, it is always safer to use a delimiter (incidentally, is there any major downside to it?).

Also, unless I'm missing something, an adversary "[prefixing] one word with another" sounds the same as what any adversary would do

I think what Aaron meant by that was just the scenario that we have been discussing: that the adversary's list may contain words for which your own list has prefixes, suffixes or other overlaps.

sts10 commented 2 years ago

I just built a little toy project that might help us shed some light on this issue.

It creates 1 million 3-word passphrases from a short word list of 4 words: "news", "paper", "elephant", "music". These represent the user's possible passphrases.

Then it compares the efficiency of two distinct attacks: One in which the attacker randomly chooses 3 words from the user's exact word list, and another in which the attack randomly chooses from a super-set word list of: "news", "paper", "elephant", "music" and "newspaper".

Here are the results I got:

Using user's exact word list and guessing randomly: Over 1M cracks, mean number of guesses was 63.9991
Using a super-set word list and guessing randomly: Over 1M cracks, mean number of guesses was 124.9907
Using user's word list and a three-word brute-force procedure: Over 1M cracks, mean number of guesses was 32.4941
Using a super-set word list and a three-word brute-force procedure: Over 1M cracks, mean number of guesses was 47.4968
Using user's word list and a two-word-then-three-word brute-force procedure: Over 1M cracks, mean number of guesses was 48.5145. 12.5076% of cracks took fewer than 25 guesses.
Using a super-set word list and a two-word-then-three-word brute-force procedure: Over 1M cracks, mean number of guesses was 68.6642. 12.4708% of cracks took fewer than 25 guesses.

This seems to show that the super-set attack is less efficient than using the exact word list.

I counted the percentage of cracks that took fewer than 25 guesses because I could see an argument that while the mean number of guesses for the super-set attack might be higher, the number of "early" cracks would be higher. At least according to my blunt and totally arbitrary assumption that fewer than 25 guesses is "early", that doesn't seem to be true -- a slightly higher percentage of cracks took fewer than 25 guesses when the code used the user's word list.

bwbug commented 2 years ago

One important point is that the concern that I have is not the general vulnerability of a randomly generated passphrase (your sample of 1 million). For this case, I am not surprised that a supserset attack will be less efficient.

Instead, the scenario I'm thinking about is the vulnerability of one specific passphrase (namely, my passphrase). For sake of argument, assume that I cannot rely on my own vocabulary and decoding abilities to determine whether or not I have a potentially vulnerable passphrase, and I naïvely rely on the expectation that my 3-word passphrase should have 6-bit entropy, and therefore require an average of 32 guesses for an adversary who knows my word list and the number of words in my passphrase (again, common assumptions in passphrase entropy estimation).

Referring back to your "clarifying question" in a previous post -- we are quantifying the vulnerability of the "unlucky" user who unknowingly ends up with a vulnerable passphrase.

Thus, for the above scenario, you should set the user's passphrase to one of newspaperelephant, newspapermusic, elephantnewspaper, musicnewspaper, newspapernews, newspaperpaper, newsnewspaper, or papernewspaper. Then have the attacker guess using the superset word list, and start with two-word combinations. For this attack mode, the average number of guesses to crack the passphrase should be 12.5 (i.e., a 61% reduction in the required cracking time compared to what the user was expecting for their supposedly 6-bit passphrase).

It might be interesting to also have the attacker randomly switch between two- and three-word guesses from their superset.

sts10 commented 2 years ago

Thus, for the above scenario, you should set the user's passphrase to one of newspaperelephant, newspapermusic, elephantnewspaper, musicnewspaper, newspapernews, newspaperpaper, newsnewspaper, or papernewspaper. Then have the attacker guess using the superset word list, and start with two-word combinations.

This is an interesting hypothetical. As expected this user suffers a worse fate against the "super-set" attack:

Attacker against a user who got "unlucky" choosing his or her passphrase

Using user's word list and a two-word-then-three-word brute-force procedure against a very unlucky user: Over 1M cracks, mean number of guesses was 32.2529. 62.4972% of cracks took fewer than 25 guesses.
Using a super-set word list and a two-word-then-three-word brute-force procedure against a very unlucky user: Over 1M cracks, mean number of guesses was 17.5060. 100.0000% of cracks took fewer than 25 guesses.

But I'm still not convinced either way!

phoerious commented 2 years ago

Hi, I got pointed to this issue by @sts10 and these are my two cents.

Overall, this is an interesting discussion, but it hinges on some misconceptions. In summary I would say that unless you have loads of prefix words, it doesn't really matter too much.

The first misconception is about the attacker's knowledge. The attacker doesn't know anything about whether and, even less so, which separators you used. I feel like this is a common misapplication of Kerckhoff's principle. The idea behind the principle is that the algorithm is known and only a specific secret is not, but separators are not part of your algorithm, they are part of your secret. My whole answer could end here, but let's assume the attacker magically got wind of the fact that you like separator-less passphrases.

So assume you have a three-word passphrase, which (without separators) could also stem from two other words on the list. If the attacker knows exactly that your passphrase is three words, then the optimal strategy would be simply to try all three-word phrases, no matter if they have prefixes or not.

In reality, the attacker wouldn't know how many words your phrase has (and again: I think assuming an attacker possesses that kind of knowledge is a misapplication of Kerckhoff's principle). In this scenario, they could gain an advantage by compressing the list of combinations when trying all one-, two-, three-, four-word phrases etc. by eliminating collisions between them (note: they would not gain an advantage by building some sort of superset of words, but by building a subset of combinations up to a certain length). This advantage, however, is very small if the word list is long and the individual word lengths don't differ massively. The chances of more than one collision in a seven-word phrase are rather low to begin with, since seven is a small fraction of your total words and with similar word lengths, it's not likely that you would find many valid composites anyway. Yet it's large enough that you benefit from the exponential growth of the search space. In the worst case, you could halve the number of words (assuming word lengths don't differ by more than a factor of two), but that would require a stupidly long word list or an extremely long phrase, in which case it wouldn't matter too much anyway. In theory, you could also have any combination of single- and double-length (or even fractional) composites in between n and n/2 words (a b c d; ab cd; a bc d; ab c d; a b cd), but I think a passphrase that allows arbitrary splits like that which can also all be found in the original word list is a pretty rare unicorn.

The other misconception is more general about how entropy is used to assess password strength. Entropy is a measure of information content of a random source, not of a password. Hence any discussion about the "entropy" of a password is completely nonsensical. The entropy of any single password is always 0. Entropy tells you how much information you are missing on average to predict the outcome of a seemingly random process accurately (or in Shannon's terms: how many binary questions you need to ask in order to explain the mechanism behind a random variable). For passwords, entropy can be misused as a proxy for password strength, but it's a very bad estimate and a common source of confusion. If you say a password has 120bits of entropy, you are actually talking about the source that generated it. If the source is a 7000-word list, you may say it has -log2(1/7000) bits of entropy multiplied by the number of words produced. In reality, however, the number of words is unknown to an attacker, so from their perspective, the actual entropy could be anything from zero to infinity. Of course it's sensible to assume you use something between 5 and 10 words, but it goes to show that calculating exact entropy values from a given set of words is impossible and that entropy itself tells you very little to nothing about how difficult a password or phrase is to crack in practice.

bwbug commented 2 years ago

Hi @phoerious -- Thank you for your thoughtful feedback on our discussion. You have given me some things to think about. But, as an initial reaction, I have the following comments/questions on your post (which I'm quoting out of sequence below):

Hence any discussion about the "entropy" of a password is completely nonsensical. The entropy of any single password is always 0.

Makes sense; no argument from me (although I'm evidently guilty of sloppy terminology).

For passwords, entropy can be misused as a proxy for password strength, but it's a very bad estimate and a common source of confusion.

I'm thinking of entropy as a lower bound on the number or guesses required to guarantee a match. Is that a valid interpretation?

If so, the more knowledge we allow the adversary to have, the more conservative our estimate of passphrase strength, right?

In reality, the attacker wouldn't know how many words your phrase has (and again: I think assuming an attacker possesses that kind of knowledge is a misapplication of Kerckhoff's principle).

As a counterpoint, don't most estimates of passphrase strength (outside the academic literature, at least) make the assumption that the number of words (and the size of the word list) is known? That's how we get all those estimates like "-log2(1/7000) bits of entropy multiplied by the number of words".

If the number of words is not known in advance, then if at least a lower and upper bound can be estimated (e.g., 5-10, as you suggested), then the number of required guesses would be given by a geometric series and not actually be that much larger than the number of guesses when the number of words is known (if equal to the upper bound).

The way I'm (currently) thinking about these kinds of issues is assuming that the attacker has possession of a leaked database containing a large number of hashed passwords. They could plausibly decide to attack the subset of passwords that consists of k words (e.g., k=6, which seems like a popular choice among diceware users right now), by running each 6-word candidate against every hash in the table.

The attacker doesn't know anything about whether and, even less so, which separators you used. I feel like this is a common misapplication of Kerckhoff's principle.

Again, you are in principle correct, of course. But in practice, the number of additional guesses doesn't rise exponentially by not having this knowledge (especially since there is a pretty small number of separator options that are popular). Or, using the paradigm described above, the attacker can make a choice to target their attack to that subset of users who choose to not use separators.

So assume you have a three-word passphrase, which (without separators) could also stem from two other words on the list. If the attacker knows exactly that your passphrase is three words, then the optimal strategy would be simply to try all three-word phrases, no matter if they have prefixes or not.

What if the attacker does not know the exact number of words, but they have a sense that the number of words is small (2-4, say), based on the source of the data or other cracks/leaks that can be linked back to this user base. Wouldn't the optimal strategy then be to start with 2-word guesses (before moving on to 3, then 4 words)? And if so, wouldn't the attacker also catch several 3-word passphrases "for free" due to the overlaps than may be present among the words in the word list?

In this scenario, they could gain an advantage by compressing the list of combinations when trying all one-, two-, three-, four-word phrases etc. by eliminating collisions between them (note: they would not gain an advantage by building some sort of superset of words, but by building a subset of combinations up to a certain length).

This sounds interesting and seems like an opportunity for me to learn something new -- but I will have to try to parse what you are saying when I have a little more brain power. Feel free to drop me some relevant keywords that I could search on to learn more about the scenario you have described here.

phoerious commented 2 years ago

I'm thinking of entropy as a lower bound on the number or guesses required to guarantee a match. Is that a valid interpretation?

In fact, it's the average. There are other entropy definitions besides the classical Shannon entropy (min entropy being the most conservative), but for a uniform distribution, they are all the same. As a proxy measure for password quality, you may treat it as some sort of rough lower bound, though.

If so, the more knowledge we allow the adversary to have, the more conservative our estimate of passphrase strength, right?

Naturally.

As a counterpoint, don't most estimates of passphrase strength (outside the academic literature, at least) make the assumption that the number of words (and the size of the word list) is known? That's how we get all those estimates like "-log2(1/7000) bits of entropy multiplied by the number of words".

Yes, but it's all for lack of a better alternative. Real-world password cracking and the entropy of the source are two entirely separate things. More random sources tend to generate better passwords, but they don't have to. Imagine you get the password "qwerty" from a very high-entropy source. It's astronomically unlikely, but possible, yet as a password it's a very bad choice, regardless of how random the source is. That's why we measure password strength based on the observed data, i.e., the generated password, even though it heavily skews our prior probabilities. Take the KeePassXC password generator, for example. It lets you select different character ranges to include in the password. In theory, the "entropy" would need to be calculated solely based on the range of possible characters, though for a medium-length password, you will never get all of them. Sometimes you don't even get a single one from a certain category and sometimes a password will contain exploitable sequences. That's why our estimate changes based on the password, even though all the generation parameters stay the same. We try to more accurately estimate the priors of a clever attacker than the actual entropy of our generation system. We do a "static" calculation only for wordlist-based passphrases, for which we have no better lower-bound estimate.

If the number of words is not known in advance, then if at least a lower and upper bound can be estimated (e.g., 5-10, as you suggested), then the number of required guesses would be given by a geometric series and not actually be that much larger than the number of guesses when the number of words is known (if equal to the upper bound).

It's still a discrete exponential, I wouldn't call that small. The dominating term is for sure the large one, but in absolute numbers, the rest is no small feat.

The way I'm (currently) thinking about these kinds of issues is assuming that the attacker has possession of a leaked database containing a large number of hashed passwords. They could plausibly decide to attack the subset of passwords that consists of k words (e.g., k=6, which seems like a popular choice among diceware users right now), by running each 6-word candidate against every hash in the table.

They could, but that's true for any password. With a sufficiently large number of hashes, guessing "some" of them becomes easier, although each individual one stays improbable. That's simply the pigeonhole principle.

Again, you are in principle correct, of course. But in practice, the number of additional guesses doesn't rise exponentially by not having this knowledge (especially since there is a pretty small number of separator options that are popular). Or, using the paradigm described above, the attacker can make a choice to target their attack to that subset of users who choose to not use separators.

There is, but then again: does an attacker even know you are using a passphrase? In some cases it may be a valid assumption, in most others it could be anything. In principle, you could use your own secret word list, then an attacker wouldn't even have that information.

What if the attacker does not know the exact number of words, but they have a sense that the number of words is small (2-4, say), based on the source of the data or other cracks/leaks that can be linked back to this user base. Wouldn't the optimal strategy then be to start with 2-word guesses (before moving on to 3, then 4 words)? And if so, wouldn't the attacker also catch several 3-word passphrases "for free" due to the overlaps than may be present among the words in the word list?

Yes, those collisions would be "free". That's exactly what I described below. If a two-word and a three-word sequence collide, it needs to be tested only once, so the combinatorial spaces becomes (a little) smaller.

bwbug commented 2 years ago

@phoerious

It's still a discrete exponential, I wouldn't call that small. The dominating term is for sure the large one, but in absolute numbers, the rest is no small feat.

I would (respectfully) argue that unless your word list is implausibly small, the increase is negligible. Even for a 10-word passphrase generated from triple-roll diceware list (with only 216 entries), the number of required guesses for an exhaustive search would be 2.21×10²³ if the number of words is known, but only increase to 2.22×10²³ if one also needs to check passphrases with fewer words. The fractional increase is inversely proportional to the number of words in the word list, which corresponds to only a small fraction of a percent for reasonably sized word lists.

More random sources tend to generate better passwords, but they don't have to. Imagine you get the password "qwerty" from a very high-entropy source. It's astronomically unlikely, but possible, yet as a password it's a very bad choice, regardless of how random the source is.

This is actually a concept that I have been thinking about for a while in the context of word lists, and it is also related to the question @sts10 had raised above about using shorter word lists for cracking. This has nothing to do with uniquely decodable codes or this repo, but since this has been an interesting exchange of ideas so far, allow me to float some of my ideas surrounding the problem of avoiding "bad luck" passphrases generated from high-entropy sources.

Based on reading accounts such as the 2013 Ars Technica article Anatomy of a Hack, I base my thinking on the observation that these types of hackers go after the low-hanging fruit first, before targeting other passwords with increasing levels of complexity. And they specifically mention combinator attacks as a strategy for dealing with passphrases. Because a sizeable percentage of passphrase users do not generate their passphrases at random (or because they re-roll their randomly generated passphrase if it contains words that they don't recognize!), an attacker would most likely have significant success by using a combinator attack based on a reasonably short word list containing commonly known (prevalent) words. I am not concerned with the fate of those users who are overconfident in their ability to generate "random" passphrases based on creativity alone (i.e., without any true entropy source). However, what about the educated user who dutifully uses a high-quality entropy source (e.g., 36 rolls of precision casino dice) to generate their passphrase? How can we prevent such a user from accidentally getting caught up in a short word list combinator attack (targeted at a less sophisticated user), if they happen to have the bad luck of generating a password based only on common words? This is the word-list equivalent of randomly generating a character-based password consisting of qwerty1234, described by @phoerious above.

To me, the only reasonable solution is to exclude the most common words from the word list you will use for generating your passphrase (e.g., using the -r option in Tidy). And if my intuition is correct, if one wants to build a 7776-word list capable of generating 12.96 bits of entropy per word, the prudent approach would be to exclude the 7776 most common words!

Opinions/feedback on these ideas would be very welcome.

Sam -- I can move this post to a more fitting repo if you want, but I wanted to take my chance to ask for feedback on this concept here, since you've assembled some thought leaders in this thread!

phoerious commented 2 years ago

I would (respectfully) argue that unless your word list is implausibly small, the increase is negligible. Even for a 10-word passphrase generated from triple-roll diceware list (with only 216 entries), the number of required guesses for an exhaustive search would be 2.21×10²³ if the number of words is known, but only increase to 2.22×10²³ if one also needs to check passphrases with fewer words.

Don't be fooled by exponentials. The term 216⁹ is negligible compared to 216¹⁰, but at one billion guesses per second, it would still add 32,455 years to your cracking time. 216⁸ is again tiny compared to that, but still adds another 150 years. In total, you would need 32,606 years to even get to the 10-word guesses. I am not sure it matters that those would then take you another 7 million years (which are again dwarfed by the 1.5 billion years of 11 words). And 216 is a tiny word list.

How can we prevent such a user from accidentally getting caught up in a short word list combinator attack (targeted at a less sophisticated user), if they happen to have the bad luck of generating a password based only on common words? This is the word-list equivalent of randomly generating a character-based password consisting of qwerty1234, described by @phoerious above.

Honestly, I wouldn't worry about it too much. If your word list and phrase are sufficiently long. As an immortal, you may worry about the outcomes of the infinite monkey theorem, but for any practical use case, the chances are almost zero. And most word lists already contain plenty of unusual words. Of course, you can make your word lists more exceptional if you want, but it's not really necessary. What you could also do is what we do for passwords and analyze the generated string after the fact. That would require a separate language model for each target language, though.

bwbug commented 2 years ago

What you could also do is what we do for passwords and analyze the generated string after the fact.

Wouldn't this exacerbate the re-roll problem (i.e., if the passphrase is found to be unsuitable, a new one must be generated, so the pool of possible passwords will shrink)?

I don't think the probability of getting common words in your passphrase is that small, because a common strategy for creating word lists is to deliberately include words that are well-known (to aid with memorization). My opinion is that the opposite approach should be used.

phoerious commented 2 years ago

No. Ideally, you would make no assumptions at all and therefore "bad" phrases should be just as likely as "good" phrases. But by introducing knowledge about phrases preferred by humans, the priors change and are no longer uniform and we know that an attacker will/may exploit these. In fact, selecting a set of English words to put on a list has already heavily influenced the distribution compared to what random bytes would be. Hence we deliberately make phrases less random in order to make them harder to crack by a biased, non-uniform attacker, who would be the main threat, since we know that our search space is large enough that random guessing will always be inefficient. That being said, after a certain number of words, it doesn't really matter, because the search space grows exponentially even with "easy" words and that's your primary security aspect.

bwbug commented 2 years ago

So what would be your practical recommendations, for somebody who knows approximately level of entropy they want, but who also wants to minimize the number of words and characters they must type each time they enter their master passphrase?

For example, let's suppose that based on my threat model and some knowledge of prevalent hash rates, I decide that I want around 72 bits of entropy. I can get there (72.4 bits) with 28 dice rolls, so I should be able to use a 7-word passphrase from a short list of 1296 words (e.g., one of the EFF's short lists). But this means that we have almost no entropy to spare, so in my opinion it is important to guard against reasonable foreseeable attacks that would significantly reduce the number of required guesses below the target value of 2⁷².

Therefore, from my perspective, one should always use defensive strategies including the use of separator characters, and the use of word lists that omit the most common/prevalent words. In addition, the minimum word length on the list should be set to account for character-level brute force attacks (e.g., a 4-letter minimum for a short wordlist of 6⁴ words).

Would be interested to hear what kind of practical guidance you can offer based on your experience (especially any advice targeted to users who want to minimize their passphrase length without sacrificing security).

phoerious commented 2 years ago

Just don't use short word lists and try not to overthink it. A 7-word phrase from a 7000-word list takes billions of years to crack if you try randomly, so even if you can optimise away a few hundred million, it doesn't matter.

sts10 commented 2 years ago

try not to overthink it

Ha! A little late for that.

Actually, this reminds me an issue on the SecureDrop repo about password strength that I wanted to tackle recently. After a ton of (over-)thinking, I ended up just writing:

We recommend using a unique, 7-word passphrase for each case described above. You can create these passphrases either by using physical dice or with KeePassXC, a password manager included with Tails.

Hope we all got something out of this discussion. Tricky stuff! Thank you for all of your input.

bwbug commented 2 years ago

OK, it's apparent that on the spectrum of convenience vs. security, we are prioritizing differently.

I think there is utility to exploring how close to the "convenience" end of the spectrum we can get without any practical reduction in security (e.g., reducing the expected cracking time from 1 billion to 1 million years would not have any practical significance for us mortals). As a real-life example, one of the motivations of Sam's blog article on password strength for Securedrop was that 7 words appeared to be too many for the journalists using this tool.

For the record, I don't think there is any reason to reject very short word lists (which can afford to contain short words, but which require a larger number of words per passphrase) or very large word lists (which require longer words, but can yield passphrases consisting of fewer words) a priori. For example, a 14-word passphrase generated from a word list containing only 95 words would essentially be equivalent to a 14-character password generated from the set of all printable ASCII characters — both methods produce 92 bits of entropy.

I'm not convinced that we have resolved the question that is the topic of this issue, but perhaps it is best to let things lie where they are for now. Thank you everybody for contributing your thoughts on this issue!

sts10 commented 2 years ago

I'm not convinced that we have resolved the question that is the topic of this issue, but perhaps it is best to let things lie where they are for now.

Didn't mean to close this off prematurely, but I sensed a bit of an impasse at the moment? In other words, I don't think we'll get a "hard" mathematical answer to the titular question. Entropy and probability is exact in some ways (the formulas produce numbers with plenty of decimal places), but it's inexact and even messy in others! One thing I'm taking away is: When in doubt, adding one more word to your passphrase solves most of these issues.

I will say that, more concretely, I haven't (yet) been convinced to remove options to make word lists uniquely decodable from my word list tool, Tidy, nor stop researching and implementing alternative processes for making word lists uniquely decodable.

bwbug commented 2 years ago

I will say that, more concretely, I haven't (yet) been convinced to remove options to make word lists uniquely decodable from my word list tool, Tidy, nor stop researching and implementing alternative processes for making word lists uniquely decodable.

Of course. It was never my intention to dissuade you from these efforts (nor to dissuade anybody from making use of your new tools). I'm guilty of sensationalizing the title of this issue a bit, towards the goal of generating more interest and discussion, but my real goal has been to create awareness that collision with words outside your word list has the (theoretical) potential to generate vulnerabilities that Sardinas-Patterson cannot solve. I think it's prudent to be aware of this possibility, since it is not an obvious concept (we tend to focus on the words that are included in out word list, not those that do not appear on our word list).

And I think it's fine to close the issue for now, although I reserve the right to re-litigate some of these concepts in the future! 😉

When in doubt, adding one more word to your passphrase solves most of these issues.

True, although what I'm interested in currently is how to securely minimize the passphrase length. I might counter by saying when in doubt, adding separators to your passphrase solves a lot of these issues...