hackerb9 / gwordlist

All the words from Google Books, sorted by frequency
109 stars 22 forks source link

Repetitions in frequency-alpha-alldicts.txt #5

Open bszollosinagy opened 1 year ago

bszollosinagy commented 1 year ago

The word "ascetic" exists more than once in the file: once at rank 18614, then at rank 25054, and also ranks 63318 and 104505.

The word "copious" and "verdant" are also duplicated for some reason.

Can the counts be simply summed across all occurrences?

hackerb9 commented 1 year ago
$ grep ascetic frequency-alpha-alldicts.txt 
18614      ascetic                      2,875,469    0.000199%   97.305329%
25054      asceticism                   1,605,339    0.000111%   98.265396%
63318      ascetical                      153,464    0.000011%   99.760632%
104505     ascetically                     24,997    0.000002%   99.955170%

It would be nice to be able to merge different forms of the same root together, as a dictionary does, but that information is not included in the Google corpus.

Do you know of any database I could use for such merging? I'm not going to write an automatic algorithm for it as it'd end up merging "cop" with "copy" and "copious".