ulif / diceware

Passphrases to remember
GNU General Public License v3.0
354 stars 45 forks source link

wordlist_de_dys2p_7776.txt added #91

Open b068931cc450442b63f5b3d276ea4297 opened 2 years ago

b068931cc450442b63f5b3d276ea4297 commented 2 years ago

Since the usability of Diceware depends on the quality of the word lists, a word list should consist of words that are as familiar and easy to remember as possible.

Our word list de-7776 is suitable as a diceware word list for five cubes. The words are unique from the fifth letter on. Furthermore, it follows these rules for the most part, but not one hundred percent:

b068931cc450442b63f5b3d276ea4297 commented 1 year ago

We created the list(s) manually, originally first for diceware with 4 dice with 1296 words, for Monero with 1626 words and Bip39 with 2048 words, and another one because we thought the most common list in German for diceware with 5 dice needed improvement. The lists and some more words are available here. They are all under the CC0-1.0 license.

more than 100 words contain soft hyphen chars (0xc2ad in utf-8). First is "agrarkultur", last "xylophon". They have to be removed before a merge can happen.

In my estimation, according to duden, it is possible for some words, but not necessary, and without hyphens is the more familiar variant. So for example with Agrarkultur

you tell the list contains no negative connotations, but it also contains "arsch" and other not too friendly words. How did you check?

We did that manually as well. We removed some with a rather/purely negative context, left others like "Arsch" in the list, because colloquially it means the buttocks rather than someone being an "ass".

I do not prefer masculine forms despite this conflicts with any bitcoin standards (BIP39?) and I wouldn't even consider this a sign of quality of a wordlist. Au contraire. So, if your list makes it into the collection, there is no guarantee that the list won't be flooded with feminine replacements in the future. If you don't want that, please tell.

This is not a problem at all and can be implemented gladly in such a way.

please give a license for the list and a copyright contact if you are not the copyright holder yourself.

https://github.com/dys2p/wordlists-de CC0-1.0 license

could you think of a shorter name for the list? People have to use the middle part as option value, when picking a list.

That's right, I was unsure about that too. You are welcome to make other suggestions.

After reviewing the list again, my current view is that the one with 1296 words is done so far, and the one with 7776 words still needs a few changes (e.g., a few nouns are plural instead of singular). However, I can't currently estimate exactly when I can revise it again. Sorry about that.

ulif commented 1 year ago

more than 100 words contain soft hyphen chars (0xc2ad in utf-8). First is "agrarkultur", last "xylophon". They have to be removed before a merge can happen.

In my estimation, according to duden, it is possible for some words, but not necessary, and without hyphens is the more familiar variant. So for example with Agrarkultur

I am afraid, this is not the point. It is not about grammar but about non-ascii chars, the raw data in your wordlist. Some lines in your wordlist contain "invisible" hyphens. Take, for instance, line 828. It looks at first sight like "basteln",

"b" "a" "s" "t" "e" "l" "n" "\n" or in hex: 0x62 0x61 0x73 0x74 0x65 0x6c 0x6e 0x0a

i.e. 7 chars plus newline. In fact the line looks like this:

"b" "a" "s" <SOFT-HYPHEN> "t" "e" "l" "n" "\n" or in hex: 0x62 0x61 0x73 0xc2 0xad 0x74 0x65 0x6c 0x6e 0x0a

i.e. 9 chars plus newline. These (SOFT-HYPHEN) chars can be found in more than 100 words of your list (but not in the others).

Of course such invisible chars can be nasty. Imagine someone copy-pasting a diceword phrase with such hidden chars when setting a password. How should the person later type this password? Will the person be aware of the hidden chars anyway?

I hope that helps to understand what my point is.

ulif commented 1 year ago

A quick check on the de-7776-wordlists on https://github.com/dys2p/wordlists-de reveals that they also suffer from the soft-hyphen problem. You might want to fix them as well.

b068931cc450442b63f5b3d276ea4297 commented 1 year ago

I am sorry that I am only now answering again. The soft-hyphen have been removed in the meantime, but I will revise the list again.