removal of words that could result in "offensive" combinations?

vbondzio commented 3 years ago

... and while most of them are funny and most people understand that this is based on chance and not personal etc., some distributions might disable "random" by default before being in some tweeted screenshot about a password that is proposing to shoot someone of some sexual orientation or bombing some deity.

I saw:

Further lists of common English words were appended to the end to allow for

subsequent removal of "inappropriate" words from the initial list.

When I forked to edit I initially compared a random "bad word" list I found online (https://www.cs.cmu.edu/~biglou/resources/bad-words.txt) and just deleted those lines:

for word in $(cat ./bad-words.txt); do sed -i "/^[[:space:]]\"${word}\",.*$/Id" ./wordset_4k.c; done

Then saw a bunch of words that would probably also fit the criteria and some comments that made me think keeping those words might be by design? Hence just leaving this as an FYI / issue unless there is interest in "sanitizing" the list and I can look into a more complete "bad word" list?

solardiz commented 3 years ago

I've been contemplating opening issue 1 for this. Thank you for beating me to it. ;-)

So I've been looking into addressing this kind of issues in passwdqc for a long time. I realized that to do it well is a huge undertaking, and then the criteria would be reconsidered, so instead of removing words and forgetting about them they should be categorized. I also realized that it's not productive use of my time to work on this further. So my current intent is to accept community contributions moving words to the end of the list, and maybe adding more words. No removals, at least not until code changes are made, because removals affect not only random passphrases but also which passwords pass or fail the "word-based" check, and because of needing categorization to allow for future reconsideration.

I think I've added enough words recently to allow for quite some removals from the initial 4096 (moves to the end of list). There are about 50% more words than required now. However, even this might not be enough. In my own strict manual application of criteria I had mention on passwdqc-users to a generated common English words list, having processed in a few hours just the words starting with the letter "a", I had only 55% of them left. 45% would be gone. This means with those criteria we need an initial total of 7500+, not the current 6000+, to have 4096 left. And indeed the resulting passphrases would be far harder to memorize - they would become kind of toothless. If the community wants this, they should feel free.

Per a Twitter poll I ran and some comments I received, there's also great demand for deliberately NSFW passphrases. Perhaps a mode like this should then be added. The categorization and bad words lists like the one you referenced (thanks!) should be helpful there.

BTW, I think you could find useful fgrep -vf bad-words.txt, eliminating the need for invoking a command in a loop. You could then drop the -v and have a list to add to the end.

So I'd appreciate you and others sending PRs like this. It won't be my problem then. ;-) I don't intend to merge further changes to the word list before releasing 2.0, though, because the word list is already set in stone in the passwdqc for Windows release 2.0, and I'd like releases for different platforms to be consistent. (I managed to push the Windows release out before the source code release because it excludes some other components that I'm considering making further changes to before 2.0.)

vbondzio commented 3 years ago

sigh ... everything becomes more complicated the more you look into it ...

Besides vulgarity and profanity, there are other potential relevant classifications based on group belonging, situation, the particular action etc. It would be extremely hard to prevent all possible "insensitive" combinations based on in itself, "safe" words. A starting point of what to avoids beyond flat out slurs could be: sexuality, religion, ethnicity related and verbs that could be considered "violent".

There are a bunch of relevant projects, some abandoned for years but none that is regularly updated / classified word lists. "DansGuardian" (last update 2012: https://sourceforge.net/projects/dansguardian/files/dansguardian-2.12.0.3.tar.bz2/download ) attempted something like weighting and combinations but a similar approach would be overkill. Some of the word lists might be usable though. A successor / fork is e2guardian but the word lists don't seem to be substantially updated: https://github.com/e2guardian/e2guardian/tree/v5.3/configs/lists/phraselists

There are a few projects on (e.g.) GitHub that track "bad" words / slurs etc but most have the same subsets of about ~300-500 vs. Luis von Ahn's 1.3k "bad-words.txt".

In summary, I couldn't (easily) find a source that does provides classified, up-to-date word lists and tackling this thoroughly would require a lot more effort, probably from someone that has already done related work.

As an interim step, I could combine a few lists and move what can be moved below 4096, that would reduce the probability but definitely not come close to your desired state ...

solardiz commented 3 years ago

As an interim step, I could combine a few lists and move what can be moved below 4096

I'd appreciate that, for merging into version 2.1 or such. Thank you!

that would reduce the probability but definitely not come close to your desired state ...

I don't have a specific desired state - like I mentioned, trying to address all concerns about maybe-inappropriate words and combinations results in harder to memorize generated passphrases. More importantly, this should be as desired by the users, and their preferences vary.

One way to address this is to have a balanced list - avoid what's "obviously" inappropriate, keep the rest, insert some more common words that are currently missing for no reason (or are below 4096). (BTW, what words are common varies by corpus, and I think even more importantly the words should be recognizable by a large fraction of users rather than commonly used. There are words that people don't use very often, but generally know at least some meanings of. Conversely, there are words that some people use somewhat more often, so there are more occurrences in a corpus, but many other people don't recognize at all. I wish we could somehow rank words by percentage of people that recognize them.)

Another way is to have multiple lists, or maybe two entry points into a list - e.g., we can group "bad" words at the very beginning and have a second entry point to right after that sub-list, so we'll have separate sets of 4096 words that would efficiently share most of them. A drawback of the latter trick is that this would look bad in the source code (worse than a list ending in "bad" words) and that only two options might not be enough (e.g., besides a completely cleaned out list and a deliberately NSFW-focused list, it could make sense to also have an unbiased uncensored list similar to what we had before my recent changes). We can also use some C preprocessor magic to #include pieces into 2 or more full 4096+ entry lists. With 2 non-overlapping lists, we could also have a no-preference mode where an extra random bit is encoded into the choice of list. Finally, we'd need a make check that would validate such lists for meeting the code's expectations.

Then there's some interest in non-English word lists for the random passphrases. (I know someone patched a Spanish word list into an older version of passwdqc.) Should they become external files configured at run time, then? This has its own pros and cons, and needs more code.

With code changes, we could also consider other word counts, length ranges, case alterations or lack thereof. Currently the code optionally toggles the case of the first letter of each word, so the 4096 input words are effectively 8192.

We could instead e.g. have only 1024 words only of length 4 and alter the case of each letter, which would be effectively 16384 in a lower maximum length and fewer maximum keypresses (needing to press Shift at most twice per word, for a total of 4 to 6 keypresses per word vs. the current maximum of 7 for a capitalized length 6 word), or 8192 if we limit to at most one Shift per word (maximum 5 keypresses per word). It's also way fewer words to review and categorize. Of course, that would be a move somewhat away from phrases and to cryptic strings, which is probably a drawback.

We could also move in the other direction - allow for longer words so that we can use e.g. EFF's lists + BIP-0039 and have 4096 or even 8192 words coming right from there. Then it's kind of not our fault that some words might be bad, because the authors of these lists had tried to avoid bad words. However, I think too many of the words included in the EFF lists are too obscure.

solardiz commented 3 years ago

There are a few projects on (e.g.) GitHub that track "bad" words

FWIW, some I had found are:

https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words https://github.com/RobertJGabriel/Google-profanity-words https://github.com/MauriceButler/badwords https://github.com/reimertz/curse-words

vbondzio commented 3 years ago

I don't see any future branches as of yet, let me know when approximately you are planning on opening one and I'll set a reminder to make the PR.

solardiz commented 3 years ago

@vbondzio This PR doesn't need to be against a new branch - it can be against main. Since no other changes to wordset_4k.c are planned yet, it should be trivial to "Rebase and merge" that PR even after some changes to other files have been made for 2.0. So you don't need to wait - you can make it whenever it's convenient for you, and I'll plan on merging it most likely when working on 2.1. Chances are you'll also end up force-pushing some further updates to the PR before it's merged. Thank you!

solardiz commented 3 years ago

@vbondzio In case you didn't notice, 2.0 has been out for a while now, and you didn't need to wait anyway. ;-)

BTW, there are some word removal commits here: https://github.com/freedomofpress/securedrop/commits/e0f900df8f39692f6dc0a9a774a58bb90cd551e4/securedrop/wordlist

vbondzio commented 3 years ago

I swear I didn't forget about it! :-) Back then I just ran down a bit of a rabbit hole trying to find some more theory / research on this and ran out of "free weekend time". I'll bump a simple removal based on a bunch of the word lists up on my todo list. Thanks for the CLs I check them out (this WE)!

openwall / passwdqc

removal of words that could result in "offensive" combinations? #1