pemistahl / lingua

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Apache License 2.0
706 stars 63 forks source link

Detection of long texts is not running parallelized #146

Closed Marcono1234 closed 2 years ago

Marcono1234 commented 2 years ago

Detection of long texts (or usage of withLowAccuracyMode()) only uses a single worker thread for language detection.

The reason for this is that a work task per ngram length is submitted. However, for long texts and when using withLowAccuracyMode() only the ngram length 3 is checked. Therefore only a single work task is submitted. One solution might be to perform the per language computation in computeLanguageProbabilities each as a separate work task; however, that approach will probably only be worth it if the input is long enough (have not verify this).

pemistahl commented 2 years ago

that approach will probably only be worth it if the input is long enough

This is true, actually. I did some tests in the past and chose the current single worker thread solution because the parallelized version was not faster in low accuracy mode. It just produced more overhead. I will most probably leave it this way. That's why I close this issue for now.