vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.57k stars 587 forks source link

Feature Request: Access to sparse vectors (TF-IDF or BM25) as generated by Vespa indexing #20647

Open tsaltena opened 2 years ago

tsaltena commented 2 years ago

Is your feature request related to a problem? Please describe. We opted for Vespa because of it's ability to cope with sparse and dense vectors at the same time. In some of our scenario's we want to compare a group of documents to the rest of the data. To do that, we combine a number of stored documents into a composite vector. Therefore, we would like to be able to access the sparse vectors generated during Vespa indexing and do some computation on them, before feeding them back into a closeness query.

Describe the solution you'd like We'd need to ways to interact: 1) Get the raw vectors as part of the document summary in a query 2) Use these raw vectors in the closeness scoring

Describe alternatives you've considered As a hack, we could use a dedicated tensor field store generated sparse embeddings (or perhaps copy them over from the index?), but this feels like a waste of resources.

bratseth commented 2 years ago

I don't understand what you mean by "sparse vectors generated during Vespa indexing" - could you explain some more?

tsaltena commented 2 years ago

With these sparse vectors I mean the actual document-term matrix rows per document, my assumption was that they will be stored somewhere as a basis for the BM25 ranking if a bm25 index is enabled?

bratseth commented 2 years ago

Right, so you'd like access to a sparse vector of term -> frequency for a field in a document. Yes, that's doable, although not something that's directly available - Vespa uses posting lists with one entry per occurrence to enable positional ranking.

Related, see the textSimilarity features in https://docs.vespa.ai/en/reference/rank-features.html which give you a measure of document similarity that also uses positional information