vespa-engine / pyvespa

Python API for https://vespa.ai, the open big data serving engine
https://pyvespa.readthedocs.io/
Apache License 2.0
98 stars 31 forks source link

Access to user query in reranking phase when using approximate neighbor search #478

Closed louisoutin closed 1 year ago

louisoutin commented 1 year ago

Hello,

I have documents indexed in vespa, with 1 field containing semantic embeddings. In would like to make a query using ANN on vectors ONLY but I want to use the query text on the reranking phase to compute some text similarity score in addition to the closeness vector score. Example:

RankProfile:

RankProfile(
      name="hybrid",
      inherits="default",
      first_phase="closeness(text_embedding) + bm25(text)",
 )

Query:

{
  "yql": f"select * from sources {schema_name} where {targetHits:100}nearestNeighbor(text_embedding,query_embedding)"
  "query": $query,
  "type": "any",
  "hits": 10, 
  "ranking.features.query(query_embedding)": $vector,
  "ranking.profile": "hybrid",
  }

Currently, if I run that query, the returned vector will only be scored using closeness(text_embedding) . But bm25(text) will have his score = 0 always.

To fix it, I have to append or userQuery() to the yql query string. However, it slow down the query and raise a Timeout exception. Is there a way to have access to user query in the first_phase ranking without adding userQuery() ? to only use it for reranking and not 0 phase matching.

louisoutin commented 1 year ago

Actually, the timeout happened mainly because I was doing a batched query with a batch size quite large. Fixed after reducing it. And I found the rank operator (https://docs.vespa.ai/en/reference/query-language-reference.html#ranked) which is what I was looking for.