Use a more precise language identification model to clean datasets

marco-c commented 11 months ago

For example, paracrawl was built by using CLD2. CCMatrix by using fastText. These models don't always work so well, for example with short sentences (see https://github.com/pemistahl/lingua-py#4-how-good-is-it).

We could consider using a more precise, even if more expensive, model. In addition, we could try to use a dictionary-lookup (in case of short sentences consisting of one or two words, check if the words are present in the dictionary for a language). We could even build a model that uses the results of another + dictionary-lookup result as a feature.

marco-c commented 11 months ago

Some alternatives we could consider: https://github.com/pemistahl/lingua-py https://github.com/mbanon/fastspell (this implements the dictionary idea)

marco-c commented 11 months ago

I added fastspell to lingua-py's benchmarks (https://github.com/pemistahl/lingua-py/pull/190) and run it for Italian, setting "it" to be similar to "es", "fr", "pt", and "en". The results are very good. fastText:

>>> Accuracy on average: 88.77%

>> Detection of 1000 single words (average length: 8 chars)
Accuracy: 74.10%
Erroneously classified as English: 9.90%, Spanish: 4.70%, Portuguese: 3.20%, French: 1.70%, German: 1.30%, Unknown: 1.00%, Esperanto: 0.60%, Turkish: 0.50%, Dutch: 0.40%, Finnish: 0.40%, Latin: 0.40%, Hungarian: 0.30%, Polish: 0.30%, Romanian: 0.30%, Croatian: 0.20%, Indonesian: 0.20%, Swedish: 0.20%, Basque: 0.10%, Czech: 0.10%, Russian: 0.10%

>> Detection of 1000 word pairs (average length: 16 chars)
Accuracy: 92.40%
Erroneously classified as English: 3.30%, Portuguese: 1.20%, Spanish: 0.90%, French: 0.30%, German: 0.30%, Finnish: 0.20%, Romanian: 0.20%, Swedish: 0.20%, Turkish: 0.20%, Basque: 0.10%, Croatian: 0.10%, Dutch: 0.10%, Esperanto: 0.10%, Latin: 0.10%, Persian: 0.10%, Slovene: 0.10%, Unknown: 0.10%

>> Detection of 1000 sentences (average length: 123 chars)
Accuracy: 99.80%
Erroneously classified as English: 0.10%, French: 0.10%

fastSpell conservative:

>>> Accuracy on average: 91.97%

>> Detection of 1000 single words (average length: 8 chars)
Accuracy: 84.10%
Erroneously classified as Unknown: 8.70%, German: 1.30%, English: 0.60%, Esperanto: 0.60%, Portuguese: 0.50%, Turkish: 0.50%, Dutch: 0.40%, Finnish: 0.40%, Latin: 0.40%, Hungarian: 0.30%, Polish: 0.30%, Romanian: 0.30%, Spanish: 0.30%, Bokmal: 0.20%, Croatian: 0.20%, Indonesian: 0.20%, Swedish: 0.20%, Basque: 0.10%, Czech: 0.10%, French: 0.10%, Russian: 0.10%, Serbian: 0.10%

>> Detection of 1000 word pairs (average length: 16 chars)
Accuracy: 92.00%
Erroneously classified as Unknown: 5.80%, Portuguese: 0.40%, German: 0.30%, Finnish: 0.20%, Romanian: 0.20%, Swedish: 0.20%, Turkish: 0.20%, Basque: 0.10%, Croatian: 0.10%, Dutch: 0.10%, Esperanto: 0.10%, Latin: 0.10%, Persian: 0.10%, Slovene: 0.10%

>> Detection of 1000 sentences (average length: 123 chars)
Accuracy: 99.80%
Erroneously classified as Unknown: 0.20%

fastSpell aggressive:

>>> Accuracy on average: 95.00%

>> Detection of 1000 single words (average length: 8 chars)
Accuracy: 87.90%
Erroneously classified as English: 3.20%, German: 1.30%, Portuguese: 1.00%, Spanish: 0.90%, Unknown: 0.70%, Esperanto: 0.60%, French: 0.60%, Turkish: 0.50%, Dutch: 0.40%, Finnish: 0.40%, Latin: 0.40%, Hungarian: 0.30%, Polish: 0.30%, Romanian: 0.30%, Bokmal: 0.20%, Croatian: 0.20%, Indonesian: 0.20%, Swedish: 0.20%, Basque: 0.10%, Czech: 0.10%, Russian: 0.10%, Serbian: 0.10%

>> Detection of 1000 word pairs (average length: 16 chars)
Accuracy: 97.10%
Erroneously classified as Portuguese: 0.50%, German: 0.30%, Spanish: 0.30%, Finnish: 0.20%, Romanian: 0.20%, Swedish: 0.20%, Turkish: 0.20%, Basque: 0.10%, Croatian: 0.10%, Dutch: 0.10%, English: 0.10%, Esperanto: 0.10%, French: 0.10%, Latin: 0.10%, Persian: 0.10%, Slovene: 0.10%, Unknown: 0.10%

>> Detection of 1000 sentences (average length: 123 chars)
Accuracy: 100.00%
Erroneously classified as

Especially compared to CLD2, used by Paracrawl:

>>> Accuracy on average: 44.27%

>> Detection of 1000 single words (average length: 8 chars)
Accuracy: 6.90%
Erroneously classified as Unknown: 91.30%, English: 1.00%, Indonesian: 0.30%, Croatian: 0.10%, Danish: 0.10%, Latin: 0.10%, Portuguese: 0.10%, Spanish: 0.10%

>> Detection of 1000 word pairs (average length: 16 chars)
Accuracy: 32.50%
Erroneously classified as Unknown: 62.40%, English: 3.30%, Portuguese: 0.70%, Swedish: 0.20%, Basque: 0.10%, Danish: 0.10%, Esperanto: 0.10%, Finnish: 0.10%, Indonesian: 0.10%, Latin: 0.10%, Romanian: 0.10%, Slovak: 0.10%, Spanish: 0.10%

>> Detection of 1000 sentences (average length: 123 chars)
Accuracy: 93.40%
Erroneously classified as Unknown: 6.10%, English: 0.50%

mozilla / firefox-translations-training

Use a more precise language identification model to clean datasets #248