Provide training data for all language profiles

fabiankessler commented 9 years ago

danielnaber mentioned that language profiles should come with the training text they are based on.

I totally agree with that. This allows anyone to play with customizations to improve the profiles, such as:

playing with profile size by setting the word count cutoff higher or lower
cleaning the text from foreign words, English phrases, ...
changing n-gram types (3-gram, 4-gram, ...)
debugging and understanding certain results
use cases we can't even think of now

The readme text for contributions should be updated to kindly ask for the training text. The best would be to get the original training text, and the program that applies modifications on it to make the index.

Since not everyone who would like to check out the language detector needs these training texts, I'd vote for keeping them separate. Otherwise a simple checkout becomes very large. On the other hand, if there are 150 GitHub projects for 150 languages, and one would like to try out 4-grams on all of them, that's quite some work also... opinions on that?

I don't recall the state of the current language profiles - I just took what was there. I'll have to see if the original texts are (easily) accessible.

danielnaber commented 9 years ago

I think one project that contains the texts for all languages would be okay, but it depends on how large the data is. I like the idea of having the original data and a script that cleans it up, but I'm not sure how practical that is. For example, one might want to remove sentences with a lot of names, proper nouns etc and detecting those isn't trivial.

dennis97519 commented 9 years ago

Well... From the original language detection project wiki

Generate language profiles from Wikipedia abstract xml

From Nakatani Shuyo's tools wiki

This tool generates language profiles from Wikipedia abstract database files or plain text.

Wikipedia abstract database files can be retrieved from "Wikipedia Downloads" ( http://download.wikimedia.org/ ). They form '(language code)wiki-(version)-abstract.xml' (e.g. 'enwiki-20101004-abstract.xml' ). See also LanguageList about language code.

We can just download the database from Wikipedia (albeit the date and version will be different) if the profiles here are just pulled as is from the original.

Also seems like danielnaber also used some tatoeba.org stuff for Esperanto :+1: I've used that site also haha.

optimaize / language-detector

Provide training data for all language profiles #21