Closed DonaldTsang closed 3 years ago
Nice ! Will look into it soon enough. Thank.
@Ousret 1.There is a major problem with franc's trigram based data: they all used the UDHR as the base dataset, which is very weak in nature https://github.com/wooorm/franc/issues/78
Okay so here is something unique https://github.com/pemistahl/lingua#4--how-good-is-it-top- Also these three uses Wikipedia as base:
There are also others who use http://wortschatz.uni-leipzig.de/en/download/ and even more exotic, https://github.com/google/corpuscrawler and with tweets, https://github.com/mitjat/langid_eval
https://github.com/davidjurgens/equilid#model-details is even more comprehensive But https://github.com/landrok/language-detector basically has a hidden dataset
Extra thing to note: Fasttext has their own dataset back at https://fasttext.cc/docs/en/dataset.html used by https://github.com/iamaziz/language-detection-fastText (Python) and https://github.com/rse/fasttext-lid (JS)
CLD3 uses machine learning instead of simpler techniques for language detection (which is made by the same people of CLD1 and CLD2, Google of all places)
@Ousret when you have free time, should we start reading these two dozen repos, one by one, analyse why some of them claims that they are the best (https://github.com/pemistahl/lingua I am looking at you) and attempt to find the best dataset and model for achieving the best results?
I have already started. Will be back.
@Ousret apologies but I updated the list last week (which has been finalized) to make sure it has most of the tools covering multiple programming language and techniques, if you don't mind. Hope you can use the current list as reference.
I Will. Thank you. 🙏
Hi, how are you? Hope you are doing well. I am planning to list all the languages that are supported by most of these libraries into a spreadsheet with alphabet information and language type for ease of comparison https://docs.google.com/spreadsheets/d/1G3VnzSifG-Vox5NPOzBXeS7GJbBxBa1iSuczjGT94AI/edit?usp=sharing
There are also other datasets like
Hi,
I tried many ways to increase the language detection coverage, but it is costly one way or the other. Most of the time, performance-wise. It is unlikely this package will change the main method of language detection sometime soon. All the research you have done was very helpful, thanks.
Is your feature request related to a problem? Please describe. Not of a problem, more of an enhancement
Describe the solution you'd like Add other languages from other repos, assuming that they use the Unicode codepoint + n-grams model.
Describe alternatives you've considered
https://github.com/Imaginatio/langdetect/tree/master/src/main/resources/profiles