Source of language datasets

richtr / guessLanguage.js

A natural language detection library based on trigram statistical analysis for Node.js and the Web.

http://richtr.github.com/guessLanguage.js/

213 stars 39 forks source link

Source of language datasets #14

Open DonaldTsang opened 4 years ago

DonaldTsang commented 4 years ago

Where is the source text dataset for the Ngrams of those 100 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

Animenosekai commented 4 years ago

Where is the source text dataset for the Ngrams of those 100 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

@DonaldTsang I really don’t know because I’m not the dev but isn’t it in _languageData.js?

_{Sent with GitHawk}

Animenosekai commented 4 years ago

Where is the source text dataset for the Ngrams of those 100 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

@DonaldTsang (inside the lib folder)

_{Sent with GitHawk}

Animenosekai commented 4 years ago

Where is the source text dataset for the Ngrams of those 100 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

@DonaldTsang But it’s weird because there isn’t all language and the ones which are in it are not written in the actual language (for example: in “fr” it isn’t written in French and I don’t understand what’s written)

_{Sent with GitHawk}

Animenosekai commented 4 years ago

Where is the source text dataset for the Ngrams of those 100 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

@DonaldTsang The dev used primarily Unicode checking to determine the language tho

_{Sent with GitHawk}

DonaldTsang commented 4 years ago

@Animenosekai if it does only use Unicode checking, that would actually be really sweet as that is very useful for my cause of making language checking easier (which I hope can re implement in Python).

DonaldTsang commented 4 years ago

The _languageData.js seems like N-Gram data.

Animenosekai commented 4 years ago

@Animenosekai if it does only use Unicode checking, that would actually be really sweet as that is very useful for my cause of making language checking easier (which I hope can re implement in Python).

I don't think that it uses only Unicode checking but why don't you open guessLanguage.js as it should contain everything you wanna know