Detecting languages in queries will "always" return English for 3 terms or less. Examples:
$ vespa query "select * from music where userInput(@text)" tracelevel=3 text='Eine kleine Nachtmusik' | grep 'Stemming with language'
"message": "Stemming with language=ENGLISH"
$ vespa query "select * from music where userInput(@text)" tracelevel=3 text='Eine kleine Nachtmusik schnell' | grep 'Stemming with language'
"message": "Stemming with language=GERMAN"
var result = prediction.getConfidence() > 0.02 ? languagesByISO3.get(prediction.getLang()) : null;
Testing multiple languages, using a debugger to track confidence, I have never been able to get something else than English (default when null) for 3 terms or less - with 4, it is possible to get a confidence > 0.02.
Anecdotally, one can get good results even with low confidence, this should be a tradeoff set by the app owner.
Also, add the notes above to https://docs.vespa.ai/en/linguistics.html - the query language detection has its shortcomings for short texts, so a better approach is to always annotate terms or set using model.{language|locale}. But if this means using something like OpenNLP yourself, it might be easier to just use the Vespa builtin OpenNLP, and configure it ...
Detecting languages in queries will "always" return English for 3 terms or less. Examples:
https://github.com/vespa-engine/vespa/blob/master/linguistics/src/main/java/com/yahoo/language/opennlp/OpenNlpDetector.java#L88 :
Testing multiple languages, using a debugger to track confidence, I have never been able to get something else than English (default when null) for 3 terms or less - with 4, it is possible to get a confidence > 0.02.
Anecdotally, one can get good results even with low confidence, this should be a tradeoff set by the app owner.
I propose we make the confidence configurable, maybe something a la https://github.com/vespa-engine/sample-apps/tree/master/examples/vespa-chinese-linguistics
Also, add the notes above to https://docs.vespa.ai/en/linguistics.html - the query language detection has its shortcomings for short texts, so a better approach is to always annotate terms or set using model.{language|locale}. But if this means using something like OpenNLP yourself, it might be easier to just use the Vespa builtin OpenNLP, and configure it ...