jawah / charset_normalizer

Truly universal encoding detector in pure Python
https://charset-normalizer.readthedocs.io/en/latest/
MIT License
581 stars 51 forks source link

[Proposal] Incrase language coverage #26

Closed DonaldTsang closed 3 years ago

DonaldTsang commented 4 years ago

Is your feature request related to a problem? Please describe. Not of a problem, more of an enhancement

Describe the solution you'd like Add other languages from other repos, assuming that they use the Unicode codepoint + n-grams model.

Describe alternatives you've considered

DonaldTsang commented 4 years ago

https://github.com/Mimino666/langdetect/issues/67

Ousret commented 4 years ago

Nice ! Will look into it soon enough. Thank.

DonaldTsang commented 4 years ago

@Ousret 1.There is a major problem with franc's trigram based data: they all used the UDHR as the base dataset, which is very weak in nature https://github.com/wooorm/franc/issues/78

  1. There seems to be a repeating pattern with Google's CLD and CLD2, that it is the most commonly cited. The reason I am avoiding CLD3 and similar is because they overused machine learning.
DonaldTsang commented 4 years ago

Okay so here is something unique https://github.com/pemistahl/lingua#4--how-good-is-it-top- Also these three uses Wikipedia as base:

There are also others who use http://wortschatz.uni-leipzig.de/en/download/ and even more exotic, https://github.com/google/corpuscrawler and with tweets, https://github.com/mitjat/langid_eval

https://github.com/davidjurgens/equilid#model-details is even more comprehensive But https://github.com/landrok/language-detector basically has a hidden dataset

DonaldTsang commented 4 years ago

Extra thing to note: Fasttext has their own dataset back at https://fasttext.cc/docs/en/dataset.html used by https://github.com/iamaziz/language-detection-fastText (Python) and https://github.com/rse/fasttext-lid (JS)

DonaldTsang commented 4 years ago

CLD3 uses machine learning instead of simpler techniques for language detection (which is made by the same people of CLD1 and CLD2, Google of all places)

DonaldTsang commented 4 years ago

@Ousret when you have free time, should we start reading these two dozen repos, one by one, analyse why some of them claims that they are the best (https://github.com/pemistahl/lingua I am looking at you) and attempt to find the best dataset and model for achieving the best results?

Ousret commented 4 years ago

I have already started. Will be back.

DonaldTsang commented 4 years ago

@Ousret apologies but I updated the list last week (which has been finalized) to make sure it has most of the tools covering multiple programming language and techniques, if you don't mind. Hope you can use the current list as reference.

Ousret commented 4 years ago

I Will. Thank you. 🙏

DonaldTsang commented 4 years ago

Hi, how are you? Hope you are doing well. I am planning to list all the languages that are supported by most of these libraries into a spreadsheet with alphabet information and language type for ease of comparison https://docs.google.com/spreadsheets/d/1G3VnzSifG-Vox5NPOzBXeS7GJbBxBa1iSuczjGT94AI/edit?usp=sharing

DonaldTsang commented 4 years ago

There are also other datasets like

Ousret commented 3 years ago

Hi,

I tried many ways to increase the language detection coverage, but it is costly one way or the other. Most of the time, performance-wise. It is unlikely this package will change the main method of language detection sometime soon. All the research you have done was very helpful, thanks.