vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.45k stars 583 forks source link

Multiple language annotation support #31372

Open buinauskas opened 4 weeks ago

buinauskas commented 4 weeks ago

Describe the bug

The language annotation is applied once even though multiple ones are provided and as a result, the search query is stemmed just once.

To Reproduce

Schema:

schema items {
    document items {
        field language type string {
            indexing: set_language | summary | attribute
            attribute {
                fast-access
                fast-search
            }
            rank: filter
        }

        field title type string {
            indexing: summary | index
            match: text
        }
    }

    fieldset default {
        fields: title
    }
}

services.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<services version="1.0">

  <admin version="2.0">
  </admin>

  <container id="default" version="1.0">
    <search/>
    <document-api/>
  </container>

  <content id="content" version="1.0">
    <redundancy>1</redundancy>
    <documents>
      <document type="items" mode="index"/>
    </documents>
  </content>

</services>

Vespa search request:

{
    "yql": "select * from items where ({language: 'fr', grammar: 'all'}userInput(@q)) or ({language: 'en', grammar: 'all'}userInput(@q))",
    "q": "machine learning",
    "trace.level": 3
}

By inspecting traces, I can see only a single trace telling that both of the query operators were stemmed using French

{
  "message": "Stemming with language=FRENCH"
}

When I swap the order of languages, it would stem only with English and both of the query operators would be stemmed using English:

{
  "message": "Stemming with language=ENGLISH"
}

Expected behavior

Language query annotation applied per operator basis.

Environment (please complete the following information):

Vespa version 8.308.26

Additional context

This can be implemented using searchers but this can be challenging for non-engineers, especially data scientists who usually know Python really well, but not Java.

kkraune commented 3 weeks ago

Thanks for reporting, we will evaluate this in the weekly ticket review