[Proposal] Incrase language coverage

DonaldTsang commented 4 years ago

Is your feature request related to a problem? Please describe. Not of a problem, more of an enhancement

Describe the solution you'd like Add other languages from other repos, assuming that they use the Unicode codepoint + n-grams model.

Describe alternatives you've considered

https://github.com/wooorm/franc/tree/master/packages/franc-all (JS, 401 languages)
- Codepoints https://github.com/wooorm/franc/blob/master/packages/franc-all/expressions.js
- Ngrams https://github.com/wooorm/franc/blob/master/packages/franc-all/data.json
https://github.com/cloudmark/language-detect (Python, 271 languages)
- https://github.com/cloudmark/language-detect/tree/master/data/udhr
https://github.com/kapsteur/franco (Golang, 175 languages)
- Codepoints https://github.com/kapsteur/franco/blob/master/expression_data.go
- Ngrams https://github.com/kapsteur/franco/blob/master/script_data.go
https://github.com/patrickschur/language-detection (PHP, 110 languages)
- https://github.com/patrickschur/language-detection/tree/master/resources
https://github.com/richtr/guessLanguage.js (JS, 100 languages)
- Codepoints https://github.com/richtr/guessLanguage.js/blob/master/lib/guessLanguage.js
- Ngrams https://github.com/richtr/guessLanguage.js/blob/master/lib/_languageData.js
https://github.com/saffsd/langid.py (Python, 97 languages)
- Alternate https://github.com/saffsd/langid.c
- Alternate https://github.com/saffsd/langid.js
- Alternate https://github.com/carrotsearch/langid-java
https://github.com/feedbackmine/language_detector (Ruby, 96 languages)
- https://github.com/feedbackmine/language_detector/tree/master/lib/training_data
https://github.com/jonathansp/guess-language (Golang, 94 languages)
- Codepoints
- https://github.com/jonathansp/guess-language/blob/master/data/blocks.go
- https://github.com/jonathansp/guess-language/blob/master/data/languages.go
- Ngrams
- https://github.com/jonathansp/guess-language/blob/master/data/trigrams.go
https://github.com/abadojack/whatlanggo (Golang, 84 languages)
- Codepoints
  - https://github.com/abadojack/whatlanggo/blob/master/script.go
  - https://github.com/abadojack/whatlanggo/blob/master/detect.go
- Ngrams https://github.com/abadojack/whatlanggo/blob/master/lang.go
https://github.com/chattylabs/language-detector (JS, 73 language)
- https://github.com/chattylabs/language-detector/tree/master/data/resources
https://github.com/optimaize/language-detector (Java, 71 languages)
https://github.com/endeveit/guesslanguage (Golang, 67 languages)
- https://github.com/endeveit/guesslanguage/tree/master/models
https://github.com/dsc/guess-language (Python, 64 languages)
- https://github.com/dsc/guess-language/tree/master/guess_language/trigrams
- Co-reference https://github.com/kent37/guess-language
https://github.com/decultured/Python-Language-Detector (Python, 58 languages)
- https://github.com/decultured/Python-Language-Detector/tree/master/trigrams
https://github.com/Mimino666/langdetect (Python, 55 languages)
- Codepoints
- https://github.com/Mimino666/langdetect/blob/master/langdetect/utils/unicode_block.py
- https://github.com/Mimino666/langdetect/blob/master/langdetect/utils/messages.properties
- https://github.com/Mimino666/langdetect/blob/master/langdetect/utils/ngram.py
- Ngrams https://github.com/Mimino666/langdetect/tree/master/langdetect/profiles
https://github.com/pemistahl/lingua (Kotlin, 55 languages)
- Codepoints https://github.com/pemistahl/lingua/blob/master/src/main/kotlin/com/github/pemistahl/lingua/internal/Alphabet.kt
- Ngrams https://github.com/pemistahl/lingua/tree/master/src/main/resources/language-models
https://github.com/landrok/language-detector (PHP, 54 language)
- https://github.com/landrok/language-detector/tree/master/src/LanguageDetector/subsets
https://github.com/shuyo/language-detection (Java, 53 languages)
https://github.com/newmsz/node-language-detection (JS, 53 languages)
- Codepoints https://github.com/newmsz/node-language-detection/blob/master/index.js
- Ngrams https://github.com/newmsz/node-language-detection/tree/master/profiles
https://github.com/pdonald/language-detection (C#, 53 languages)
- https://github.com/pdonald/language-detection/tree/master/LanguageDetection/Profiles
https://github.com/malcolmgreaves/language-detection (Java, 53 languages)
https://github.com/FGRibreau/node-language-detect (JS, 52 languages)
- Codepoints https://github.com/FGRibreau/node-language-detect/blob/master/data/unicode_blocks.json
- Ngram https://github.com/FGRibreau/node-language-detect/blob/master/data/lang.json
https://github.com/webmil/text-language-detect (PHP, 52 languages)
- Codepoints https://github.com/webmil/text-language-detect/blob/master/lib/data/unicode_blocks.dat
- Ngram https://github.com/webmil/text-language-detect/blob/master/lib/data/lang.dat
https://github.com/pear/Text_LanguageDetect (PHP, 52 languages)
- https://github.com/pear/Text_LanguageDetect/tree/master/data
https://github.com/Imaginatio/langdetect (Java, 50 languages)
- https://github.com/Imaginatio/langdetect/tree/master/src/main/resources/profiles
https://github.com/dachev/node-cld (C++, 160 languages)
- co-reference https://github.com/jtoy/cld
- co-reference https://github.com/mzsanford/cld
- co-reference https://github.com/jaukia/cld-js
- co-reference https://github.com/vhyza/language_detection
- Co-referecne https://github.com/ambs/Lingua-Identify-CLD
- Co-reference https://github.com/jaukia/cld-js
https://github.com/CLD2Owners/cld2 (C++, 83 languages)
- Co-reference https://github.com/rainycape/cld2
- Co-reference https://github.com/dachev/node-cld
- Co-reference https://github.com/ropensci/cld2
- Co-reference https://github.com/fntlnz/cld2-php-ext
https://github.com/commoncrawl/language-detection-cld2 (Java)
https://github.com/lstrojny/php-cld (PHP)

DonaldTsang commented 4 years ago

https://github.com/Mimino666/langdetect/issues/67

Ousret commented 4 years ago

Nice ! Will look into it soon enough. Thank.

DonaldTsang commented 4 years ago

@Ousret 1.There is a major problem with franc's trigram based data: they all used the UDHR as the base dataset, which is very weak in nature https://github.com/wooorm/franc/issues/78

There seems to be a repeating pattern with Google's CLD and CLD2, that it is the most commonly cited. The reason I am avoiding CLD3 and similar is because they overused machine learning.

DonaldTsang commented 4 years ago

Okay so here is something unique https://github.com/pemistahl/lingua#4--how-good-is-it-top- Also these three uses Wikipedia as base:

https://github.com/optimaize/language-detector
https://github.com/shuyo/language-detection
https://github.com/Mimino666/langdetect
anything that uses code.google.com/p/language-detection

There are also others who use http://wortschatz.uni-leipzig.de/en/download/ and even more exotic, https://github.com/google/corpuscrawler and with tweets, https://github.com/mitjat/langid_eval

https://github.com/davidjurgens/equilid#model-details is even more comprehensive But https://github.com/landrok/language-detector basically has a hidden dataset

DonaldTsang commented 4 years ago

Extra thing to note: Fasttext has their own dataset back at https://fasttext.cc/docs/en/dataset.html used by https://github.com/iamaziz/language-detection-fastText (Python) and https://github.com/rse/fasttext-lid (JS)

DonaldTsang commented 4 years ago

CLD3 uses machine learning instead of simpler techniques for language detection (which is made by the same people of CLD1 and CLD2, Google of all places)

https://github.com/akihikodaki/cld3-ruby (Ruby)
https://github.com/bsolomon1124/pycld3 (Python)
https://github.com/kwonoj/cld3-asm (Wasm)
https://github.com/ntedgi/cld3-kotlin (Kotlin)
https://github.com/ropensci/cld3 (C++)

DonaldTsang commented 4 years ago

@Ousret when you have free time, should we start reading these two dozen repos, one by one, analyse why some of them claims that they are the best (https://github.com/pemistahl/lingua I am looking at you) and attempt to find the best dataset and model for achieving the best results?

Ousret commented 4 years ago

I have already started. Will be back.

DonaldTsang commented 4 years ago

@Ousret apologies but I updated the list last week (which has been finalized) to make sure it has most of the tools covering multiple programming language and techniques, if you don't mind. Hope you can use the current list as reference.

Ousret commented 4 years ago

I Will. Thank you. 🙏

DonaldTsang commented 4 years ago

Hi, how are you? Hope you are doing well. I am planning to list all the languages that are supported by most of these libraries into a spreadsheet with alphabet information and language type for ease of comparison https://docs.google.com/spreadsheets/d/1G3VnzSifG-Vox5NPOzBXeS7GJbBxBa1iSuczjGT94AI/edit?usp=sharing

DonaldTsang commented 4 years ago

There are also other datasets like

http://www.cs.cmu.edu/~ralf/langid.html with ~1000 languages
Language dataset for Langid https://github.com/saffsd/langid.py
Language dataset for Langdetect https://dumps.wikimedia.org/backup-index.html
Language dataset for CLD2 https://github.com/CLD2Owners/cld2/blob/master/internal/test_shuffle_1000_48_666.utf8.gz
Language dataset for Lingua https://wortschatz.uni-leipzig.de/de

Ousret commented 3 years ago

Hi,

I tried many ways to increase the language detection coverage, but it is costly one way or the other. Most of the time, performance-wise. It is unlikely this package will change the main method of language detection sometime soon. All the research you have done was very helpful, thanks.

jawah / charset_normalizer

[Proposal] Incrase language coverage #26

https://github.com/Imaginatio/langdetect/tree/master/src/main/resources/profiles