[Schema streaming mode] Enhence rank calculation for substring search

vespa-engine / vespa

AI + Data, online. https://vespa.ai

https://vespa.ai

Apache License 2.0

5.8k stars 604 forks source link

[Schema streaming mode] Enhence rank calculation for substring search #30132

Open akolhun opened 9 months ago

akolhun commented 9 months ago

Is your feature request related to a problem? Please describe.

Given the schema as:

document test {
field description type string {
    indexing: summary | index
    match: substring
}
}

And a document is created with description=environmental
Then the following 2 search requests select * from test where description contains 'environment' select * from test where description contains 'env' return the matching doc with exactly the same score/relevance=0.38

Describe the solution you'd like Considering the sample above, request with search_term=environment should have have a higher score than the request with search_term=env

jamesbond7 commented 9 months ago

Isn't it a bug? Vespa's documentation says: "...Streaming search uses the same implementation of most features in Vespa, including ranking, matching and grouping, and supports the same features...". We are working on hybrid search in streaming and we do very rely on the correct ranking. Thanks

baldersheim commented 9 months ago

Documentation is not perfect. There are a few differences. We are currently trying to reduce the gap. But there will always be some differences. Streaming search have a larger feature set especially related to matching as there we always have the raw text available. substring matching is a feature only available for streaming search. That is why improving the rank here is an enhancement, and not a bug.

jamesbond7 commented 9 months ago

We will appreciate if you will be able to prioritize the issue.

jobergum commented 9 months ago

@jamesbond7

Vespa index mode doesn't support substring, so you could not match env against environment - so this is obviously an enhancement and not a bug.

bratseth commented 9 months ago

Yes, this is a new feature, but one that makes sense. How about creating a separate rank feature ("matchAccuracy"?) that gives the term-weighted average of the closeness of the match of the term to the field? Could also potentially use it with multiple stems.