Closed espindola closed 10 months ago
Hi! Thanks for the PR!
Some considerations:
:)
Hey! As above, thank you for the PR (and effort). In addition to the above questions might I ask what benefit this list would have over the existing "legacy" wordlist? It is based on http://wordlist.aspell.net/12dicts/ which also focuses on common words (with some filters applied for eg profanity)
Hey! As above, thank you for the PR (and effort). In addition to the above questions might I ask what benefit this list would have over the existing "legacy" wordlist? It is based on http://wordlist.aspell.net/12dicts/ which also focuses on common words (with some filters applied for eg profanity)
Unlike 12dicts (and like the eff list):
Hi! Thanks for the PR!
My pleasure.
Some considerations:
1. Is all wiktionary data on a license compatible with this project's BSD 3-Clause? I couldn't tell from skimming over the [copyrights page](https://en.wiktionary.org/wiki/Wiktionary:Copyrights).
I am sorry, but I am really not qualified to answer that. I am more than happy to modify the PR to have only a script that lets you create your own word list if that is a problem.
2. Does the data mining by the kaikki group generate any relevant license changes to the data? I can't find information about it on the page.
Same as 1, sorry.
3. Kaikki asks on the bottom of the page for a citation of their project in case the data is used in an academic work. Maybe it's a good idea to add an acknowledgement somewhere in the documentation?
Can do. In README.rst?
4. It's a good idea to programmatically sanitize the kaikki data, possibly deduplicating some words.
I did more than that, I removed prefixes.
5. Could you expand a bit on the advantage of this wordlist? I'm not sure I totally understand. Does the variety of words it adds account for mostly familiar words, rather than obscure ones?
Both. See my previous reply.
As this stands, especially with the uncertainty around licencing, I don't feel comfortable merging this list. However if you're willing to provide a complete script (or scripts) for an end-user to run I would be happy to include it in the project's contrib
The EFF wordlist has a lot of awesome properties, like no word being a prefix of another, but is designed for use with dices, so it is a bit short.
Building a wordlist out of the words defined in dictionary (like the legacy one was) creates a much bigger list, but with very obscure words.
Most lists of common words (like the Oxford English Corpus) are not freely available, so I thought of finding the most common words in Wikipedia.
Unfortunately for this, I found wikipedia is a bit too big and hard to process.
The second best was the wiktionary. The idea is not to include every word defined in it, but to find the common words used in definitions and example.
Processing the wiktionary is made easy by the json published by https://kaikki.org.
The wordlist in this PR was created by an odd mix of shell and go. The main script is
And remove-prefixes.go removes words so that no word is a prefix of another:
There is a lot of room for yak shaving over the exact heuristics, but a quick sampling with
xkcdpass -w wiktionary --min 3 --max 8
creates passwords that I find easy to remember. The list also has over 2x as many words as eff-long and the average word size is smaller.
I can convert the generation script from sh+go to python and include it in the pull request if desired.