***** | NOTICE | ***** |
---|---|---|
This repository serves large files using GitHub's LFS which now charges for bandwidth. If you receive a quota error, download the tiny 1gramsbyfreq.sh shell script. Running that on your own machine will download Google's entire corpus (over 15 GB) and then, after much processing, prune it down to 0.25 GB. | ||
***** | ***** |
This project includes wordlists derived from Google's ngram corpora plus the programs used to automatically download and derive the lists, should you so wish.
The most import files:
frequency-all.txt.gz 266 MB. Compressed list of all 29 billion words in the corpus, sorted by frequency. Decompresses to over 2 GB. Includes words with weird symbols, numbers, misspellings, OCR errors, and foreign languages.
frequency-alpha-alldicts.txt 18 MB. List of the 246,591 alphabetical words which were able to be verified by various dictionaries. (GCIDE/Websters1913, WordNet, and OEDv2). Sorted by frequency.
1gramsbyfreq.sh: The main shell script which downloads the data from Google and extracts frequency information.
Here's a sample of one of the files:
#RANKING WORD COUNT PERCENT CUMULATIVE
1 , 115,513,165,249 5.799422% 5.799422%
2 the 109,892,823,605 5.517249% 11.316671%
3 . 86,243,850,165 4.329935% 15.646607%
4 of 66,814,250,204 3.354458% 19.001065%
5 and 47,936,995,099 2.406712% 21.407776%
Interestingly, if this data is right, only five words make up 20% of all the words in books from 1880 to 2020. And two of those "words" are punctuation marks!! (Don't believe comma is a word? I've also created wordlists that exclude punctuation. See the files named "alpha").
I needed my [XKCD 936]() compliant password generator to have a good list of words in order to make memorable passphrases. Most lists I've seen are not terribly good for my purposes as the words are often from extremely narrow domains. The best I found was [SCOWL](), but I didn't like that the words weren't sorted by frequency so I couldn't easily take a slice of, say, the top 4096 most frequent words.
The obvious solution was to use Google's ngram corpus which claims to have a trillion different words pruned from all the books they've scanned for books.google.com (about 4% of all books ever published, they say). Unfortunately, while some people had posted small lists, nobody had the entire list of every word sorted by frequency. So, I made this and here it is.
Anything you want. While my programs are licensed under the GNU GPL ≥3, I'm explicitly releasing the data produced under the same license as Google granted me: Creative Commons Attribution 3.0.
There are 37,235,985 entries in the V3 (20200217) corpus, but it's a
mistake to think there are 37 million different, useful words. For
example, 6% of the words found are a single comma. Google used completely
automated OCR techniques to find the words and it made a lot of
mistakes. Moreover, their definition of a “word” includes things like
s
, A4oscow
, IIIIIIIIIIIIIIIIIIIIIIIIIIIII
, cuando
, لاامش
,
ihm
,SpecialMarkets@ThomasNelson
, buisness
[sic], and ,
.
To compensate, they only included words in the corpus that appeared at least 40 times, but even so there's so much dreck at the bottom of the list that it's really not worth bothering. Personally, I found that words that appeared over 100,000 times tended to be worthwhile. In addition, I was getting so many obvious OCR errors that I decided to also create some cleaner lists by using dict to check every word against a dictionary. (IMPORTANT NOTE! If you run these scripts, be sure to setup your own dictd so you're not pounding the internet servers for a bazillion lookups.)
After pruning with dictionaries, I found 65536 words seemed like a
more reasonable number to cutoff. However, the script currently does
not limit the number of words. Because this part has not been
optimized yet, it can take a very long time. For faster runs, set
maxcount=65536
.
If you run my scripts (which are tiny) they will download about 14 GiB of data from Google. However, if you simply want the final list, it uncompresses to over 350 MB. Alternately, if you don't need so many words, consider downloading one of the smaller files I created that have been cleaned up and limited to only the top words verified in dictionaries, such as frequency-alpha-alldicts.txt.
As you can guess, since the file size went down by 90%, I tossed a lot of info. The biggest changes were from losing the separate counts for each year, ignoring the tags for part of speech (e.g., I used only the count for "watch", which includes the counts for watch_VERB with watch_NOUN), and from combining different capitalization into a single term. (Each word is listed under its most frequent capitalization: for example, "London", instead of "london".) If you need that data, it's not hard to modify the scripts. Let me know if you have trouble.
I counted up the total number of words in all the books so I could get a rough percentage of how often each word was being used in English. I also include a running total of the percentage so you can truncate the file wherever you want. (E.g., to get a list of 95% of all words used in English).
The corpus includes words suffixed with an underscore and then a tag marking what part of speech the word appears to have been used. For example:
#5101 watch 76,770,311 0.001284% 85.124506%
#8225 watch_VERB 44,060,908 0.000737% 88.174382%
#10464 watch_NOUN 32,697,074 0.000547% 89.601624%
Words tagged with part of speech appear to be simply duplicate
counts of the root word. In the example of watch
above, note that
76,770,311 ≈ 44,060,908 + 32,697,074.
List of Part of Speech tags (from books.google.com/ngrams/info)
NOUN noun (Examples: time_NOUN
, State_Noun
, Mr._Noun
)
VERB verb (Examples: is_VERB
, be_VERB
, have_VERB
)
ADJ adjective (Examples: other_ADJ
, such_ADJ
, same_ADJ
)
ADV adverb (Examples: not_ADV
, when_ADV
, so_ADV
)
PRON pronoun (Examples: it_PRON
, I_PRON
, he_PRON
)
DET determiner or article (Examples: the_DET
, a_DET
, this_DET
)
ADP an adposition: either a preposition or a postposition (Examples: of_ADP
, in_ADP
, for_ADP
)
NUM numeral (Examples: one_NUM
, 1_NUM
, 2001_NUM
)
CONJ conjunction (Examples: and_CONJ
, or_CONJ
, but_CONJ
)
PRT particle (Examples: to_PRT
, 's_PRT
, '_PRT
out_PRT
)
Part of speech tags, undocumented by Google:
. punctuation (Example: ,_.
)
X ??? (Example: [_X
, *_X
, =_X
, etc._x
, de_X
, No_X
)
Google uses these tags for searching, but they don't appear (at least in 1-grams): ROOT root of the parse tree These tags must stand alone (e.g., START) START start of a sentence END end of a sentence
Use Makefile for dependencies so that multiprocessing is built in
(using make -j
), instead of having to append & to commands.
Use comm
instead of dict
to check wordlists against
dictionaries.
The number of books a word occurs in should help determine popularity. Perhaps popularity = occurences × books ?
Github does not allow files larger than 100MB. The file frequency-all.txt.gz is 266 MB, so it has been placed on git-lfs.
Hyphenated words do not appear in the 1-gram list. Why not? Perhaps they are considered 2-grams?
I may need a manually created "stopword" list due to all the obviously non-English words appearing in the list.
Some of the 1-grams I'm turning up as quite popular should actually be 2-grams: e.g. York -> New York. Maybe I should add in 2-grams to the list, since some of them will clearly be in the list of most common "words".
Some words should be capitalized, such as "I" and "London". But it makes sense to accumulate "the" and "The", since otherwise both will be listed as one of the most common words.
Solution: Accumulate twice. First time case-sensitive. Sort by frequency. Then, second time, case-insensitive, outputting the first variation found.
I'm currently getting some very strange results, or at least unexpected, results. While the 100 words seem reasonably common, there are some strangely highly ranked words:
124 s 147 p 151 J 165 de 202 M 209 general 214 B 225 S 226 Mr 228 York 238 D 241 government 254 R 272 et 282 E 291 John 292 University 294 U 309 H 325 P 328 pp 359 English 365 L 371 v 373 London 390 W 391 Fig 399 e 405 F 422 Figure 426 G 444 British 445 T 446 c 455 N 466 II 472 b 478 French 479 England 508 St 509 General
Compare that with common words that are found much less frequently:
2124 eat
4004 TV
6040 ate
6041 bedroom
6138 fool
10007 foul
10012 swim
10017 sore
15013 lone
15020 doom
Certain domain specific texts overuse the same words over and over. For example, looking at just the total number of uses, bats and psychosocial are equally common. However, bats is used in twice as many books.
I should weight the number of occurrences by the number of different books the word is found in. But what is the proper weight?
Most naive approach would be to multiply the two numbers together. Is that sufficient? Defensible?
f(occurrences, books): return (occurrences * books)
Can I ignore occurrences? What would a sort by number of books the word has occurred in at least once look like?
f(occurrences, books): return (books)
Am I adding things up correctly? The least amount of times any word is found is 40. Was that a cut off when they were creating the corpus, presuming words that showed up less than that were OCR errors?
Yes.
There are a bunch of nonsense/units/foreign words mixed in to this corpus. How can I get rid of them all easily?
** Maybe I can get a list of unit abbreviations and grep them out?
lbs, J, gm, ppm
** Maybe look up words in gcide and reject non-existent words? OED is too liberal.
cuando, aro, ihm
** A lot of the words that are of type "_X" are suspicious and there's only 159 of them in the over-1E6 list.
*** Some are not in WordNet and can be easily discarded:
et dem bei durch deux der per je ibid wird und auf su comme lui que ch
della hoc quam del ou auch bien cette les zur sont seq ont du che
facto leur nur di una einer entre ich op sich avec um mais qui nicht
inasmuch zum peut dans por ah vel quae los eine vous esse sunt im quod
nach como une ein aux wie ist lo sie fait las aus werden dei
*** However, that still leaves 83 that are not as easy:
de e el il au r u tout hell esp b d est sur iv pas sa nous ni z la f
se in das chap fig er oder des ii iii m mit als dear alas ma c le o h
ex para j vii mi no yes den x oh vi ut bye mm en die l zu v well pro w
ab al un si ne ce es k cf viii i y non ad g cum ha sind te
*** Most of the real words ("well", "hell", "dear", "chap", "no", "den", "die", show up as other types of speech. On the other hand, words like "bye", "yes", and "alas" are definitely words, and they're not listed under any other type than _X. (What does _X mean? Interjection?)
At first I tried accumulating a different count for each usage of a word (e.g., watch_VERB and watch_NOUN), but that meant some words would split their vote and not be listed among the most common. [This does not seem to be the case in the 2020 dataset in which the plain word is equal to the total of the various part of speech versions].
Also, it meant I had many duplicates of the same word.
The current solution is to skip any words with an underscore in them.
r_ADP 1032605 out_ADV 9199818 r_CONJ 1048981 out_ADJ 8645123 r_PRON 1019601 out_PRT 332451517 r_NUM 3316486 out_NOUN 4462386 r_X 3125051 out_ADP 159492310 r_NOUN 2975438 r_VERB 2931183 r_PRT 1044181 r_ADJ 2327691
There are 117 words with no vowels, none of them real words. grep '^[^aeiouy]*_' foo
Some words are contractions:
cit_NOUN (webster says it means citizen, but given how commonly used it is, more often than "dogs", maybe it was for citations?)
Some words make no sense whatsoever to me:
eds_NOUN 12084339
Some words are british:
programme
If I had some way to accumulate words to their lemmas ("head word"), that would maybe allow me to accumulate them so they'll make the 1E6 threshold of useful words: (watching, watches, watched, -> watch)
** Perhaps dict using wordnet? No. Websters? Sort of. It works for 'watching'->'watch', but not 'dogs' -> 'dog'.
There are some odd orderings. How can "go" be less common than "children"?
Some words appear to be misspellings.
buisness
Some words may be misspellings or OCR problems.
ADJOURNMEN, ADMINISTATION, bonjamin, Buddhisn
Some words are clearly OCR errors, not misspellings in the original
A1most, A1ways, A1uminum, a1titude
A1nerica, A1nerican
ADMlNlSTRATlON, LlBRARY, lNSTANT, lNTERNAL, LlVED, lDEAS, lNVERSE, lRELAND
lNTRODUCTlON, lNlTlAL, lNTERlM, (and on and on...)
areheology
anniverfary
beingdeveloped
A0riculture, A0erage, 0paque, 0ndustry, 1nch
A9riculture, a9ain, a9ainst, a9ent, a9ked,
Aariculture
AAppppeennddiixx
Thmking
A4oscow (should be "Moscow")
Some words have been mangled by Google on purpose:
can't, cannot -> "can not" (bigram)
2020 format is WORD [ TAB YEAR,COUNT,BOOKS ]+ Alcohol 1983,905,353 1984,1285,433 1985,1088,449
List of Corpora (from books.google.com/ngrams/info)
American English 2012 eng_us_2012 googlebooks-eng-us-all-20120701 American English 2009 eng_us_2009 googlebooks-eng-us-all-20090715 Books predominantly in the English language that were published in the United States.
British English 2012 eng_gb_2012 googlebooks-eng-gb-all-20120701 British English 2009 eng_gb_2009 googlebooks-eng-gb-all-20090715 Books predominantly in the English language that were published in Great Britain.
English 2012 eng_2012 googlebooks-eng-all-20120701 English 2009 eng_2009 googlebooks-eng-all-20090715 Books predominantly in the English language published in any country.
English Fiction 2012 eng_fiction_2012 googlebooks-eng-fiction-all-20120701 English Fiction 2009 eng_fiction_2009 googlebooks-eng-fiction-all-20090715 Books predominantly in the English language that a library or publisher identified as fiction.
English One Million eng_1m_2009 googlebooks-eng-1M-20090715 The "Google Million". All are in English with dates ranging from 1500 to 2008. No more than about 6000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980).