malcolmgreaves / language-detection

Automatically exported from code.google.com/p/language-detection . Some after-the-fact modifications to get this working within sbt.
Apache License 2.0
5 stars 5 forks source link

Source of language datasets #77

Open DonaldTsang opened 5 years ago

DonaldTsang commented 5 years ago

Where is the source text dataset for the Ngrams of those 53 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

malcolmgreaves commented 5 years ago

Hi! This is a fork of the original project: I did not currate the source datasets. IIRC from when I found this project years ago, the data comes from various language-specific Wikipedia dumps.

DonaldTsang commented 5 years ago

@malcolmgreaves are there any way of getting the tools for scraping the Wikipedia dumps and processing it for use like in this project?