This PR uses the updated exquisite-corpus that gets its Web data from OSCAR, providing a large amount of natural text in a number of languages.
Given our criteria that a "small" wordlist needs 3 sources of data and a "large" wordlist needs 5, this adds new support for the following languages:
Filipino (Tagalog)
Icelandic
Lithuanian
Malayalam
Slovak
Slovenian
Tamil
Urdu
Vietnamese
These languages now have "large" wordlists:
Bengali
Catalan
Hebrew
Norwegian Bokmål
Swedish
Ukrainian
Vietnamese is included at the syllable level, because syllables are separated by spaces. At one point I believed that Vietnamese would have to be handled more like Chinese, with a heuristic-based tokenizer that grouped multiple syllables together into words, but when I scan the Vietnamese Wiktionary data I find that the multi-syllable entries seem to be phrases that decompose into defined words. Tokenizing Vietnamese by splitting on spaces seems to also put us in line with how other NLP systems handle it.
Not sure how Galician got in there, because it only has 2 sources. I'm guessing it was a file left behind from a different build attempt where it had 3, before I filtered some data.
This PR uses the updated exquisite-corpus that gets its Web data from OSCAR, providing a large amount of natural text in a number of languages.
Given our criteria that a "small" wordlist needs 3 sources of data and a "large" wordlist needs 5, this adds new support for the following languages:
These languages now have "large" wordlists:
Vietnamese is included at the syllable level, because syllables are separated by spaces. At one point I believed that Vietnamese would have to be handled more like Chinese, with a heuristic-based tokenizer that grouped multiple syllables together into words, but when I scan the Vietnamese Wiktionary data I find that the multi-syllable entries seem to be phrases that decompose into defined words. Tokenizing Vietnamese by splitting on spaces seems to also put us in line with how other NLP systems handle it.