lanterndata / lantern

PostgreSQL vector database extension for building AI applications
https://lantern.dev
GNU Affero General Public License v3.0
790 stars 57 forks source link

Forbid standalone usage of <-> operator using statement AST #56

Closed var77 closed 1 year ago

var77 commented 1 year ago

Currently we have only one operator <->

In src/hnsw/options.c file there's a function HnswGetMetricKind which will determine current operator class being used with and index and detect right metric kind for usearch index using the support function pointers.

This is great as we can have only one operator which will support various distance functions, but when used out of index scope for example in SELECT statement, the operator can not automatically detect which distance function should be used.

We are currently throwing an error when <-> is used out of index lookup. We are doing this using ExecutorStart_hook the hook implementation is defined in src/hnsw/options.c void executor_hook. This function receives QueryDesc struct, and we are currently doing regexp matching on sourceText. This approach is not covering cases when the operator will be used with ORDER BY, but there won't be an index scan.

To fix all the cases we might use plannedstmt which contains the AST of planned statement, where we can find information about the indexes and much more.

After doing this changes theres hnsw_operators_todo.sql test file. The file should be renamed to hnsw_operators.sql and included in schedule.txt file

dqii commented 1 year ago

Wouldn't this mean that if there is an ORDER BY <-> but the index is not triggered, e.g., because the cost estimator determined that it would be more cost effective to do a sequential search, then an error would be thrown?

dqii commented 1 year ago

It's a bit strange to me that I can't do <-> in SELECT. I did this a few times when testing pgvector to sanity check that I was getting the nearest vectors.

var77 commented 1 year ago

Yes that is true, it will throw an error in case of sequential scans, but here is some backstory why we decided to implement this feature.

We wanted to make the UX better by maintaining only one operator which will automatically determine which kind of distance function should be called. Currently in alternative solutions there is an separate operator for each distance function (e.g <->, <=>, <~>) so you should remember which operator to use for particular index. Currently on index scans we can determine the right distance function, as we can look up the support function from the index Relation, but in case of sequential scan or SELECT statement we couldn't find an optimal way to determine which distance function to call. In case of SELECT I think it is not possible, but in case of seq scan there might be a way to look up the indexes defined on that field and try to determine the right distance function based on that information, but we might loose some performance there.

So at this moment we decided to forbid the usage of the operator, so it will not be confusing that let's say sometimes the operator returns l2 distance if used out of index and when used with index it returns cosine distance.

We can discuss this topic further with @Ngalstyan4 , maybe we can come up with a better solution.