pemistahl / lingua-py

The most accurate natural language detection library for Python, suitable for short text and mixed-language text
Apache License 2.0
1.08k stars 44 forks source link

Strategy to split text by language #134

Closed mirix closed 1 year ago

mirix commented 1 year ago

Hello,

I would like to split multilingual texts by language.

I have a database of emails some of which are multilingual. There are some 40 languages and, a priory, one does not know which languages may be present in each email.

Only three languages are significant enough to warrant a machine learning approach.

The following is relatively accurate, but I would like to know if there would be a better strategy to optimise accuracy and/or speed:

from lingua import Language, LanguageDetectorBuilder
from collections import defaultdict

main_langs = ['ENGLISH', 'FRENCH', 'GERMAN']

sentence = "Parlez-vous français? " + \
            "Ich spreche Französisch nur ein bisschen. " + \
            "Desde luego, no cabe ninguna duda. " + \
            "Oui, merci. Je vous en prie monsieur. " + \
            "A little bit is better than nothing. That is wonderful, isn't it. " + \
            "Indeed, I completely agree. " + \
            "Acho que sim, muito obrigado. " + \
            "Ja, das ist wunderbar. Danke." + \
            "To summarise, this is complete nonsense. "

detector_global = LanguageDetectorBuilder.from_all_spoken_languages().build()

languages = []
for result in detector_global.detect_multiple_languages_of(sentence):
    lang = result.language.name
    languages.append(eval('Language.' + lang))

languages = list(set(languages))

detector_local = LanguageDetectorBuilder.from_languages(*languages).build()

lang_dict = defaultdict(list)
for result in detector_local.detect_multiple_languages_of(sentence):
    lang = result.language.name
    text = sentence[result.start_index:result.end_index]
    if lang in main_langs:
        lang_dict[lang].append(text)
    else:
        lang_dict['OTHER'].append(text)

print(lang_dict)

This gives:

{'FRENCH': ['Parlez-vous français? '], 'GERMAN': ['Ich spreche Französisch nur ein bisschen. ', 'ist wunderbar. Danke.To '], 'OTHER': ['Desde luego, no cabe ninguna duda. Oui, merci. Je vous ', 'en prie monsieur. A ', 'Acho que ', 'sim, muito obrigado. Ja, das '], 'ENGLISH': ["little bit is better than nothing. That is wonderful, isn't it. Indeed, I completely agree. ", 'summarise, this is complete nonsense. ']}

Which is not perfect, but it may be fair enough considering the level of difficulty of this exercise (shuffled multilingual short sentences).

The strategy here is as follows:

  1. First, brute-force language detection.
  2. Followed by text classification with the subset of languages detected in step one.

This seems to work somehow better than a single brute-force step (would that make sense?) and significantly better than a targetted approach where the only models considered are English, French and German.

Any thoughts or ideas?

mirix commented 1 year ago

Finally, I am using a different strategy: First sentence tokenisation and then language recognition for each sentence. All with Stanza. It is not perfect but it is accurate, fast and low-code.