rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
965 stars 68 forks source link

Version 2.5, incorporating OSCAR data #91

Closed rspeer closed 3 years ago

rspeer commented 3 years ago

This PR uses the updated exquisite-corpus that gets its Web data from OSCAR, providing a large amount of natural text in a number of languages.

Given our criteria that a "small" wordlist needs 3 sources of data and a "large" wordlist needs 5, this adds new support for the following languages:

These languages now have "large" wordlists:

Vietnamese is included at the syllable level, because syllables are separated by spaces. At one point I believed that Vietnamese would have to be handled more like Chinese, with a heuristic-based tokenizer that grouped multiple syllables together into words, but when I scan the Vietnamese Wiktionary data I find that the multi-syllable entries seem to be phrases that decompose into defined words. Tokenizing Vietnamese by splitting on spaces seems to also put us in line with how other NLP systems handle it.

rspeer commented 3 years ago

Not sure how Galician got in there, because it only has 2 sources. I'm guessing it was a file left behind from a different build attempt where it had 3, before I filtered some data.