Open DonaldTsang opened 5 years ago
Hi! This is a fork of the original project: I did not currate the source datasets. IIRC from when I found this project years ago, the data comes from various language-specific Wikipedia dumps.
@malcolmgreaves are there any way of getting the tools for scraping the Wikipedia dumps and processing it for use like in this project?
Where is the source text dataset for the Ngrams of those 53 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.