ulif / diceware

Passphrases to remember
GNU General Public License v3.0
357 stars 45 forks source link

Improve pt-br wordlist #63

Closed drebs closed 5 years ago

drebs commented 5 years ago

Wordlist in pt-br was first introduced in 7743ed5. The differences to this one are:

The current pt-br wordlist was generated as follows:

  1. Download a dump of portuguese Wikipedia pages, process all pages and determine the frequency of each word.
  2. Start from /usr/share/dict/brazilian and filter out:
    • words not matching /^[a-z]+$/,
    • words shorter than 4 characters, and
    • words longer than 9 characters.
  3. Sort remaining words using pt Wikipedia frequencies.
  4. Take the top 30K words (just because after filtering we still get roughly the amount we need).
  5. Filter out:
    • all words that are a suffix of any other word in the list.
    • less frequent words that differ only by the last character.
  6. Take the 7776 most frequent words.

No further curation was made.

ulif commented 5 years ago

Nice, thank you!