objectbox / objectbox-dart

Flutter database for super-fast Dart object persistence
https://docs.objectbox.io/getting-started
Apache License 2.0
1.01k stars 117 forks source link

Docs: explain how to use Vector Search maxResultCount combined with additional criteria #658

Open jonny07 opened 1 month ago

jonny07 commented 1 month ago

Is there an existing issue?

Build info

Steps to reproduce

TODO Tell us exactly how to reproduce the problem.

  1. create a query for vector search, I did with embeddings, want to have 2 results: final query = box .query(Message_.embedding.nearestNeighborsF32(search_embedding, 2)) .build(); => Works fine

  2. combine the search with a second criteria. final query = box .query( Message_.embedding.nearestNeighborsF32(searchembedding, 2) .and( Message.chatid.equals(character)) ).build();

=> Results given are 0 or 1 or 2 results. Expected are 2 results.

I assume that the first condition is fulfilled, it searches for 2 results independent of the second criteria. Then second criteria is applied and from 2 results only 1 or 0 or 2 remain.

Expected behavior

Find 2 results with vector search also with additional conditions.

Actual behavior

Amount of results vary, depending on if the found results of step 1 fulfill the second criteria or not.

Note after analysis

In your documentation the following I assume will also not work as expected: https://docs.objectbox.io/on-device-vector-search final query = box .query(City.location.nearestNeighborsF32(madrid, 2) .and(City.name.startsWith("B"))) .build();

Just figured out that I could probably use "limit" from here https://docs.objectbox.io/queries in the "query" and leave the limit out in the "vector search" (leave it out is not possible, I just set it to several million). Would just be a question to you how this will work regarding ressources, how the vector search is implemented, but assume that will work. Another workaround for me might be to work with a stream and interrupt the stream after 2 results.

So it might not be a bug, maybe just the documentation above needs to be adopted.

greenrobot-team commented 1 month ago

Thanks for this issue! Note that the maxResultCount parameter only applies to the results of the nearest neighbor search, see also the API documentation on how to use it. It does not apply to the final query. As you have guessed, use the limit API for that. This is also hinted at in the API documentation.

leave it out is not possible, I just set it to several million

The maxResultCount parameter exists to improve performance of the nearest neighbor search. The higher the allowed number of results, the longer the nearest neighbor search sub-query will take to compute (obviously with a larger impact if the data set is large).

We should probably copy this from the API documentation and add this to the web documentation at https://docs.objectbox.io/on-device-vector-search

jonny07 commented 1 month ago

Thanks a lot for your feedback! It currently works fine for me and I'm very happy with objectbox, thanks a lot for your great work! Currently my database is quite small. So my solution with setting the "maxResultCount" Parameter of nearest neighbour search very high will probably lead to performance problems for large databases. Using the stream as I mentioned before will probably also not work, as also for this I would need to set the "maxResultCount" parameter - which is unknown when I combine it with other criteria. I assume for the algorithm to work on very large databases in combination with other criteria it would need to be implemented in a way to just search till enough results are found, without the user specifying "maxResultCount". This could e.g. be done by first applying all other criteria and on the result on that search perform the nearest neighbour till enough results are found or evaluating both in parallel till the wanted amounts are found. For me, I could also create an own database, so e.g. 20x same databases and then just search one of the databases, so I don't need to combine nearest neighbour with other criteria then.