Version 2.5, incorporating OSCAR data

This PR uses the updated exquisite-corpus that gets its Web data from OSCAR, providing a large amount of natural text in a number of languages.

Given our criteria that a "small" wordlist needs 3 sources of data and a "large" wordlist needs 5, this adds new support for the following languages:

Filipino (Tagalog)
Icelandic
Lithuanian
Malayalam
Slovak
Slovenian
Tamil
Urdu
Vietnamese

These languages now have "large" wordlists:

Bengali
Catalan
Hebrew
Norwegian Bokmål
Swedish
Ukrainian

Vietnamese is included at the syllable level, because syllables are separated by spaces. At one point I believed that Vietnamese would have to be handled more like Chinese, with a heuristic-based tokenizer that grouped multiple syllables together into words, but when I scan the Vietnamese Wiktionary data I find that the multi-syllable entries seem to be phrases that decompose into defined words. Tokenizing Vietnamese by splitting on spaces seems to also put us in line with how other NLP systems handle it.

rspeer / wordfreq

Version 2.5, incorporating OSCAR data #91