vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.79k stars 604 forks source link

Rewrite nearest neighbor search guide to use native hf embedders #28088

Closed eostis closed 1 year ago

eostis commented 1 year ago

An example below from https://docs.vespa.ai/en/embedding.html#embedding-a-query-text:

{ "yql": "select from sources where ...", "query": "semantic search", "input.query(embedding)": "embed(e5, contextualized search)", "input.query(embedding2)": "embed(e5, neural search)" "ranking": "semantic", }

This code is ambiguous with the "...". How can we really use the two query embeddings here?

An other example from https://docs.vespa.ai/en/nearest-neighbor-search-guide.html#multiple-nearest-neighbor-search-operators-in-the-same-query:

vespa query \ 'yql=select title, matchfeatures from track where ({ label:"q", targetHits:10}nearestNeighbor(embedding,q)) or ({label:"qa",targetHits:10}nearestNeighbor(embedding,qa))' \ 'hits=2' \ 'ranking=closeness-label' \ "input.query(q)=$Q" \ "input.query(qa)=$QA"

Are "$Q" and "QA" 2 shell variables, or are they part of YQL syntax?

It looks like a detail, but I had to refer to blog articles to find useable code too ambiguous in the documentation.

jobergum commented 1 year ago

This code is ambiguous with the "...". How can we really use the two query embeddings here?

Because you can retrieve by price > 1 and still have those embeddings created, or you can have four nearestNeighbor query operators (connected by or/and/rank). That is the reason why this example does not demonstrate a specific query tree (retrieval).

Are "$Q" and "QA" 2 shell variables, or are they part of YQL syntax?

No, they are shell variables set in the section above.

jobergum commented 1 year ago
where rank({targetHits:100}nearestNeighbor(doc, embedding), {targetHits:100}nearestNeighbor(doc, embedding2))
where ({targetHits:100}nearestNeighbor(doc, embedding)) or ({targetHits:100}nearestNeighbor(doc, embedding2)))

where rank(title contains "nike" and price > 100, {targetHits:100}nearestNeighbor(doc, embedding), {targetHits:100}nearestNeighbor(doc, embedding2))```
eostis commented 1 year ago

Thanks @jobergum. But this issue is about the whole documentation: all code examples should be more realistic.

For instance, your three examples would be great in the documentation as examples of hybrid and multi vector queries.

jobergum commented 1 year ago

Yes, I will add them to the linked resource.

this issue is about the whole documentation: all code examples should be more realistic

Feel free to point out concrete examples where we do not use realistic examples.

On the shell variables. I don't know how to improve on it.

image

eostis commented 1 year ago

Just expanding all variables would be clearer. People are scanning the documentation with a narrow vision.

(You will not believe me, but I did not see the variables content in the documentation, until now !!)

vespa query \
    'yql=select title from track where ({targetHits:10}nearestNeighbor(embedding,q)) or ({targetHits:10}nearestNeighbor(embedding,qa))' \
    'hits=2' \
    'ranking=closeness-t4' \
    "input.query(q)=Total Eclipse Of The Heart query vector" \
    "input.query(qa)=Summer of ‘69 query vector"
jobergum commented 1 year ago

Good idea. We can rewrite this guide to use native embedders instead.

jobergum commented 1 year ago

Fixed by https://github.com/vespa-engine/documentation/pull/2847 and now live on https://docs.vespa.ai/en/nearest-neighbor-search-guide.html