vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.47k stars 584 forks source link

Add support for filtering on a mapped dimension of a tensor field during nearest neighbor search #29587

Open farshidz opened 7 months ago

farshidz commented 7 months ago

Is your feature request related to a problem? Please describe. We have tensor fields with two mapped dimensions and one indexed dimension tensor<float>(p{}, q{}, x[384]). We would like to be able to search within one or more specific p dimension values.

Describe the solution you'd like The nearest neighbor search operator to support one or more values for a mapped dimension. e.g.,

nearestNeighbor(tensor_field, query_vector, p=('value1', 'value2', 'value3'))

so that the search will only considers vectors where mapped dimension p has one of the values value1, value2 or value3.

Describe alternatives you've considered Alternatively the nearest neighbor operator can accept only a single value for the mapped dimension, and for multiple values the query will have to consist of a disjunction of multiple nearest neighbor operators.

Additional context N/A

jobergum commented 7 months ago

Vespa supports this today using tensor compute expressions but not in the context of the nearestNeighbor query operator or HNSW indexing. Maybe you could elaborate on why you need it for retrieval? And illustrative use case would help.

farshidz commented 7 months ago

We index vectors that are labeled. At search time, we need to retrieve only vectors with one or more specific labels. While this could be handled by creating a tensor field for each label, the full set of labels isn't known in advance, so we have to rely on a generic tensor field and use a mapped dimension for the label. Even though vectors for different labels do not follow the exact same distribution, in practice we have seen good recall with this approach with HNSW.

pandu-k commented 6 months ago

Hi @jobergum ! We haven't yet found a solution for the problem @farshidz is describing. We may have some capacity at some point to work on contributing this feature to Vespa. If we go down this route are there any tips to get started, or resources to point us towards?

jobergum commented 6 months ago

This is fully supported with tensor compute expressions but not HNSW indexing for efficient retrieval. So, if you can limit it to ranking phases, the functionality is there.

I would say that this is a very complex task for someone without a deep knowledge of the code base.

pandu-k commented 5 months ago

Thanks @jobergum . Having this functionality at retrieval-time is key for our use case. Any estimate when this can become available? Or anything we can do from our end to help this get done?

bratseth commented 5 months ago

The Vespa core work needed is expert level so probably not suitable for external contributions (although you're welcome to assess this yourself - code and build instructions are on GitHub).

We do plan to get to this at some point but no ETA currently. If you are a paying customer you can create a support ticket instead and we'll set an ETA.

Workaround: Add the labels to a string array in the document in addition to the tensor, and filter on that.