detect_multiple_languages_of is very slow

jordimas commented 1 year ago

Using version 1.3.1

In a text that is 3.5K (31 lines) in my machine _detect_multiple_languagesof takes 26.56 seconds while _detect_languageof takes only 1.68 seconds.

26 seconds to analyse 3.5K of text (throughput of ~7 seconds per 1K) makes _detect_multiple_languagesof method really not suitable for processing large corpus.

Code used for the benchmark:


from lingua import Language, LanguageDetectorBuilder, IsoCode639_1
import datetime

with open('text.txt') as fh:
    text = fh.read()

    detector = LanguageDetectorBuilder.from_all_languages().build()

    start_time = datetime.datetime.now()
    result = detector.detect_language_of(text)
    print('Time used for detect_language_of: {0}'.format(datetime.datetime.now() - start_time))
    print(result.iso_code_639_1)

    start_time = datetime.datetime.now()    
    results = detector.detect_multiple_languages_of(text)    
    print('Time used for detect_multiple_languages_of: {0}  '.format(datetime.datetime.now() - start_time))    
    for result in results:
        print(result)
        print(f"** {result.language.name}")

jordimas commented 1 year ago

Text file used in the example text.txt

pemistahl commented 1 year ago

The logic required to detect multiple languages is much more expensive than detecting a single language only. I will surely try to optimize the algorithm in later releases.
The implementation currently is pure Python and Python is slow. I will try to compile some of the library to native code so that it runs faster. If you know how to program in Go, you can also try the Go version of Lingua which performs much faster than the Python one. It contains the same implementation for detecting multiple languages.

pemistahl / lingua-py

detect_multiple_languages_of is very slow #108