hackerb9 / gwordlist

All the words from Google Books, sorted by frequency
109 stars 22 forks source link

gwordlist

***** NOTICE *****
This repository serves large files using GitHub's LFS which now charges for bandwidth. If you receive a quota error, download the tiny 1gramsbyfreq.sh shell script. Running that on your own machine will download Google's entire corpus (over 15 GB) and then, after much processing, prune it down to 0.25 GB.
***** *****

This project includes wordlists derived from Google's ngram corpora plus the programs used to automatically download and derive the lists, should you so wish.

The most import files:

What does the data look like?

Here's a sample of one of the files:

#RANKING   WORD                             COUNT      PERCENT   CUMULATIVE
1          ,                      115,513,165,249    5.799422%    5.799422%
2          the                    109,892,823,605    5.517249%   11.316671%
3          .                       86,243,850,165    4.329935%   15.646607%
4          of                      66,814,250,204    3.354458%   19.001065%
5          and                     47,936,995,099    2.406712%   21.407776%

Interestingly, if this data is right, only five words make up 20% of all the words in books from 1880 to 2020. And two of those "words" are punctuation marks!! (Don't believe comma is a word? I've also created wordlists that exclude punctuation. See the files named "alpha").

Why does this exist?

I needed my [XKCD 936]() compliant password generator to have a good list of words in order to make memorable passphrases. Most lists I've seen are not terribly good for my purposes as the words are often from extremely narrow domains. The best I found was [SCOWL](), but I didn't like that the words weren't sorted by frequency so I couldn't easily take a slice of, say, the top 4096 most frequent words.

The obvious solution was to use Google's ngram corpus which claims to have a trillion different words pruned from all the books they've scanned for books.google.com (about 4% of all books ever published, they say). Unfortunately, while some people had posted small lists, nobody had the entire list of every word sorted by frequency. So, I made this and here it is.

What can this data be used for?

Anything you want. While my programs are licensed under the GNU GPL ≥3, I'm explicitly releasing the data produced under the same license as Google granted me: Creative Commons Attribution 3.0.

How many words does it really have?

There are 37,235,985 entries in the V3 (20200217) corpus, but it's a mistake to think there are 37 million different, useful words. For example, 6% of the words found are a single comma. Google used completely automated OCR techniques to find the words and it made a lot of mistakes. Moreover, their definition of a “word” includes things like s, A4oscow, IIIIIIIIIIIIIIIIIIIIIIIIIIIII, cuando, لاامش, ihm,SpecialMarkets@ThomasNelson, buisness[sic], and ,.

To compensate, they only included words in the corpus that appeared at least 40 times, but even so there's so much dreck at the bottom of the list that it's really not worth bothering. Personally, I found that words that appeared over 100,000 times tended to be worthwhile. In addition, I was getting so many obvious OCR errors that I decided to also create some cleaner lists by using dict to check every word against a dictionary. (IMPORTANT NOTE! If you run these scripts, be sure to setup your own dictd so you're not pounding the internet servers for a bazillion lookups.)

After pruning with dictionaries, I found 65536 words seemed like a more reasonable number to cutoff. However, the script currently does not limit the number of words. Because this part has not been optimized yet, it can take a very long time. For faster runs, set maxcount=65536.

How big are the files?

If you run my scripts (which are tiny) they will download about 14 GiB of data from Google. However, if you simply want the final list, it uncompresses to over 350 MB. Alternately, if you don't need so many words, consider downloading one of the smaller files I created that have been cleaned up and limited to only the top words verified in dictionaries, such as frequency-alpha-alldicts.txt.

What got thrown away in these subcopora?

As you can guess, since the file size went down by 90%, I tossed a lot of info. The biggest changes were from losing the separate counts for each year, ignoring the tags for part of speech (e.g., I used only the count for "watch", which includes the counts for watch_VERB with watch_NOUN), and from combining different capitalization into a single term. (Each word is listed under its most frequent capitalization: for example, "London", instead of "london".) If you need that data, it's not hard to modify the scripts. Let me know if you have trouble.

What got added?

I counted up the total number of words in all the books so I could get a rough percentage of how often each word was being used in English. I also include a running total of the percentage so you can truncate the file wherever you want. (E.g., to get a list of 95% of all words used in English).

Part of Speech tags

The corpus includes words suffixed with an underscore and then a tag marking what part of speech the word appears to have been used. For example:

#5101      watch                    76,770,311      0.001284%      85.124506%
#8225      watch_VERB               44,060,908      0.000737%      88.174382%
#10464     watch_NOUN               32,697,074      0.000547%      89.601624%

Bugs

To Do

LFS

Github does not allow files larger than 100MB. The file frequency-all.txt.gz is 266 MB, so it has been placed on git-lfs.

Misc Notes

Compare that with common words that are found much less frequently:

2124 eat
4004 TV
6040 ate
6041 bedroom
6138 fool
10007 foul
10012 swim
10017 sore
15013 lone
15020 doom

** Maybe I can get a list of unit abbreviations and grep them out?

  lbs, J, gm, ppm

** Maybe look up words in gcide and reject non-existent words? OED is too liberal.

cuando, aro, ihm

** A lot of the words that are of type "_X" are suspicious and there's only 159 of them in the over-1E6 list.

*** Some are not in WordNet and can be easily discarded:

et dem bei durch deux der per je ibid wird und auf su comme lui que ch
della hoc quam del ou auch bien cette les zur sont seq ont du che
facto leur nur di una einer entre ich op sich avec um mais qui nicht
inasmuch zum peut dans por ah vel quae los eine vous esse sunt im quod
nach como une ein aux wie ist lo sie fait las aus werden dei

*** However, that still leaves 83 that are not as easy:

de e el il au r u tout hell esp b d est sur iv pas sa nous ni z la f
se in das chap fig er oder des ii iii m mit als dear alas ma c le o h
ex para j vii mi no yes den x oh vi ut bye mm en die l zu v well pro w
ab al un si ne ce es k cf viii i y non ad g cum ha sind te

*** Most of the real words ("well", "hell", "dear", "chap", "no", "den", "die", show up as other types of speech. On the other hand, words like "bye", "yes", and "alas" are definitely words, and they're not listed under any other type than _X. (What does _X mean? Interjection?)

** Perhaps dict using wordnet? No. Websters? Sort of. It works for 'watching'->'watch', but not 'dogs' -> 'dog'.