Configurable confidence to language detector

kkraune commented 2 years ago

Detecting languages in queries will "always" return English for 3 terms or less. Examples:

$ vespa query "select * from music where userInput(@text)" tracelevel=3 text='Eine kleine Nachtmusik' | grep 'Stemming with language'
                                "message": "Stemming with language=ENGLISH"

$ vespa query "select * from music where userInput(@text)" tracelevel=3 text='Eine kleine Nachtmusik schnell'  | grep 'Stemming with language'
                                "message": "Stemming with language=GERMAN"

https://github.com/vespa-engine/vespa/blob/master/linguistics/src/main/java/com/yahoo/language/opennlp/OpenNlpDetector.java#L88 :

var result = prediction.getConfidence() > 0.02 ? languagesByISO3.get(prediction.getLang()) : null;

Testing multiple languages, using a debugger to track confidence, I have never been able to get something else than English (default when null) for 3 terms or less - with 4, it is possible to get a confidence > 0.02.

Anecdotally, one can get good results even with low confidence, this should be a tradeoff set by the app owner.

I propose we make the confidence configurable, maybe something a la https://github.com/vespa-engine/sample-apps/tree/master/examples/vespa-chinese-linguistics

Also, add the notes above to https://docs.vespa.ai/en/linguistics.html - the query language detection has its shortcomings for short texts, so a better approach is to always annotate terms or set using model.{language|locale}. But if this means using something like OpenNLP yourself, it might be easier to just use the Vespa builtin OpenNLP, and configure it ...

johans1 commented 2 years ago

make the confidence configurable, keep default as now

baldersheim commented 1 year ago

soon timed out

vespa-engine / vespa

Configurable confidence to language detector #24265