Add a wordlist based on the wiktionary

espindola commented 1 year ago

The EFF wordlist has a lot of awesome properties, like no word being a prefix of another, but is designed for use with dices, so it is a bit short.

Building a wordlist out of the words defined in dictionary (like the legacy one was) creates a much bigger list, but with very obscure words.

Most lists of common words (like the Oxford English Corpus) are not freely available, so I thought of finding the most common words in Wikipedia.

Unfortunately for this, I found wikipedia is a bit too big and hard to process.

The second best was the wiktionary. The idea is not to include every word defined in it, but to find the common words used in definitions and example.

Processing the wiktionary is made easy by the json published by https://kaikki.org.

The wordlist in this PR was created by an odd mix of shell and go. The main script is

set -euo pipefail
export LANG=c

curl -O https://kaikki.org/dictionary/English/kaikki.org-dictionary-English.json

cat kaikki.org-dictionary-English.json |
    jq ' .senses | map(.glosses,(.examples //[] | map(.text)))' |
    tr -c '[:alpha:]' '\n'          | # turn non alpha into newlines
    sed -e 's/vv/w/g' -e 's/VV/W/g' | # OCR issue I guess
    grep  '...'                     | # at least 3 characters
    grep -v '.........'             | # at most 8 characters
    grep '[a-z]'                    | # at least a lowercase letter
    tr '[:upper:]' '[:lower:]'      | # ignore case
    sort | uniq -c | sort -n > common

grep -E -v ' (1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28) ' common | # only common words
    awk '{print $2}'    | # only the word
    sort | go run remove-prefixes.go > wiktionary

And remove-prefixes.go removes words so that no word is a prefix of another:

package main

import (
    "os"
    "strings"
    "fmt"
    "bufio"
)

func readLines() []string {
    s := bufio.NewScanner(os.Stdin)
    ret := []string{}
    for s.Scan() {
        ret = append(ret, s.Text())
    }
    return ret
}

func prefixOf(lines []string, i int) int {
    l := lines[i]
    n := 0
    for _,o := range(lines[i+1:]) {
        if strings.HasPrefix(o, l) {
            n += 1
        } else {
            return n
        }
    }
    return n
}

func remove(v []string, i int) []string {
    return append(v[0:i], v[i+1:]...)
}

func main() {
    lines := readLines()
    for i := len(lines) - 1; i >= 0; i-- {
        n := prefixOf(lines, i)
        if n > 1 {
            lines = remove(lines, i)
        } else if n == 1 {
            lines = remove(lines, i + 1)
        }
    }

    for _,v := range(lines) {
        fmt.Println(v)
    }
}

There is a lot of room for yak shaving over the exact heuristics, but a quick sampling with

xkcdpass -w wiktionary --min 3 --max 8

creates passwords that I find easy to remember. The list also has over 2x as many words as eff-long and the average word size is smaller.

I can convert the generation script from sh+go to python and include it in the pull request if desired.

anakimluke commented 1 year ago

Hi! Thanks for the PR!

Some considerations:

Is all wiktionary data on a license compatible with this project's BSD 3-Clause? I couldn't tell from skimming over the copyrights page.
Does the data mining by the kaikki group generate any relevant license changes to the data? I can't find information about it on the page.
Kaikki asks on the bottom of the page for a citation of their project in case the data is used in an academic work. Maybe it's a good idea to add an acknowledgement somewhere in the documentation?
It's a good idea to programmatically sanitize the kaikki data, possibly deduplicating some words.
Could you expand a bit on the advantage of this wordlist? I'm not sure I totally understand. Does the variety of words it adds account for mostly familiar words, rather than obscure ones?

:)

redacted commented 1 year ago

Hey! As above, thank you for the PR (and effort). In addition to the above questions might I ask what benefit this list would have over the existing "legacy" wordlist? It is based on http://wordlist.aspell.net/12dicts/ which also focuses on common words (with some filters applied for eg profanity)

espindola commented 1 year ago

Hey! As above, thank you for the PR (and effort). In addition to the above questions might I ask what benefit this list would have over the existing "legacy" wordlist? It is based on http://wordlist.aspell.net/12dicts/ which also focuses on common words (with some filters applied for eg profanity)

Unlike 12dicts (and like the eff list):

No two words are prefixes
Letters only (the legacy has 'twixt for example)
Has a more strict notion of what common is (legacy has Aalborg for example)

espindola commented 1 year ago

Hi! Thanks for the PR!

My pleasure.

Some considerations:

1. Is all wiktionary data on a license compatible with this project's BSD 3-Clause? I couldn't tell from skimming over the [copyrights page](https://en.wiktionary.org/wiki/Wiktionary:Copyrights).

I am sorry, but I am really not qualified to answer that. I am more than happy to modify the PR to have only a script that lets you create your own word list if that is a problem.

2. Does the data mining by the kaikki group generate any relevant license changes to the data? I can't find information about it on the page.

Same as 1, sorry.

3. Kaikki asks on the bottom of the page for a citation of their project in case the data is used in an academic work. Maybe it's a good idea to add an acknowledgement somewhere in the documentation?

Can do. In README.rst?

4. It's a good idea to programmatically sanitize the kaikki data, possibly deduplicating some words.

I did more than that, I removed prefixes.

5. Could you expand a bit on the advantage of this wordlist? I'm not sure I totally understand. Does the variety of words it adds account for mostly familiar words, rather than obscure ones?

Both. See my previous reply.

redacted commented 10 months ago

As this stands, especially with the uncertainty around licencing, I don't feel comfortable merging this list. However if you're willing to provide a complete script (or scripts) for an end-user to run I would be happy to include it in the project's contrib

redacted / XKCD-password-generator

Add a wordlist based on the wiktionary #149