optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
567 stars 165 forks source link

Source of language corpus #103

Open DonaldTsang opened 4 years ago

DonaldTsang commented 4 years ago

Where is the source text dataset for the Ngrams of those 70 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

"There are two kinds of profiles. The standard ones created from Wikipedia articles and similar. And the "short text" profiles created from Twitter tweets."