Open OptimusLime opened 6 months ago
- What is the best practice for filtering character memories for embedding search only within a single character?
- If characters share a single table and all queries are prefilter, what are the performance implications?
The determining factor is how selective is the filter.
prefilter
.
- Is there an alternative structure to pursue to maximize performance?
We might later implement composite index where you can include certain metadata columns within the ANN index itself. So the character ID could be stored alongside the vector in the index. That would make prefilter a bit faster for that column.
- Is it better to create many unique character tables or prefilter on a single large character memory table?
Creating many unique character tables could also work. You'd have to benchmark to see if that's optimal for your use case right now.
The determining factor is how selective is the filter.
- If the filter narrows down to < 1,000 rows, then prefilter + (exact) KNN search makes the most sense. You can do this by not creating an index on the column and using
prefilter
.- Otherwise when there are many results, then prefilter + ANN search makes more sense. If the filter isn't very selective (matches a lot of data), this is pretty quick. But if it only matches say 10% of rows, then you will see somewhat slower queries, as it will have to throw out about 90% of ANN matches.
As of 0.4.20, I only saw two example files in the Rust examples directory, and neither of them come close to demonstrating 1 or 2 above.
You're going to think I'm being so rude, please excuse me. The issue is a specific conceptual example matching our data, but your response, unfortunately, doesn't point towards examples or code -- and it's kind of filled with jargon I'm trying hard to parse.
Is there Nodejs or Python or Rust examples that are near to what I'm asking? For example, is there something out there for performing exact KNN versus ANN? I wish I was an expert at LanceDB, like you, however some clues as to best practices with the library would be helpful.
- Is there an alternative structure to pursue to maximize performance?
We might later implement composite index where you can include certain metadata columns within the ANN index itself. So the character ID could be stored alongside the vector in the index. That would make prefilter a bit faster for that column.
Very cool!
Description
Love the use of the library so far, thank you for all your work, I'm currently using LanceDB inside the Godot game engine via the Rust API and its a breeze.
Use Case:
I have multiple characters in an environment, I am storing all their individual memories in a single table.
Setup:
Issue:
Clarification in Docs Needed:
Example:
Table: characters Character [Paul] plays basketball at 10:20am. Character [Cody] plays basketball at 10:30am.
empty_cody_memories has no entries because when searching for "basketball," the embedding is close to BOTH Paul and Cody memories, therefore finds Paul as the closest then applies post filter of name == Cody and now I don't see any of Paul's basketball memories.
cody_prefiltered_memories has an entry because first we remove all the non-Cody memories from the table and then we perform our nearest neighbor search, yielding only cody memories related to basketball as expected.
Link
postfilter rust API ref with a small statement on prefilter