lanterndata / lantern

PostgreSQL vector database extension for building AI applications
https://lantern.dev
Other
748 stars 53 forks source link

Distance operator `<->` wrongly used in CROSS-JOIN queries without a vector index #227

Closed Ngalstyan4 closed 10 months ago

Ngalstyan4 commented 10 months ago

Query:

-- 1)
CREATE TABLE imdb_reviews (
  id SERIAL PRIMARY KEY,
  imdb_id int NOT NULL UNIQUE,
  review text,
  positive_review bool
);

-- 2) populate table
 INSERT INTO imdb_reviews (imdb_id,review, positive_review) VALUES %s

-- 3) generate review_embedding column via the cloud

-- 4) Run the query
SELECT
  forall.imdb_id, 
  nearest_per_id.near_imdb_ids, nearest_per_id.imdb_dists
FROM
  (
    SELECT
      imdb_id, review_embedding
    FROM
      imdb_reviews
    LIMIT 100000
  ) AS forall
  JOIN LATERAL (
    SELECT
      ARRAY_AGG(imdb_id) AS near_imdb_ids, 
      ARRAY_AGG(imdb_dist) AS imdb_dists
    FROM
      (
        SELECT
          t2.imdb_id,
          cos_dist(forall.review_embedding, t2.review_embedding) AS imdb_dist
        FROM
          imdb_reviews t2
        ORDER BY
          forall.review_embedding <-> t2.review_embedding
        LIMIT
          5
      ) AS __unused_name
  ) nearest_per_id ON TRUE
ORDER BY
  forall.imdb_id;

Expected: the query in step 4 to fail with error Operator <-> can only be used inside of an index

Seeing: The query runs a sequential scan with the following query plan:

                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Sort  (cost=97888136.64..97888199.14 rows=25000 width=68)
   Sort Key: imdb_reviews_new111.imdb_id
   ->  Nested Loop  (cost=6117.34..97886310.43 rows=25000 width=68)
         ->  Limit  (cost=0.00..5577.00 rows=25000 width=471)
               ->  Seq Scan on imdb_reviews_new111  (cost=0.00..5577.00 rows=25000 width=471)
         ->  Memoize  (cost=6117.34..6117.36 rows=1 width=64)
               Cache Key: imdb_reviews_new111.review_embedding, imdb_reviews_new111.review_embedding
               Cache Mode: binary
               ->  Aggregate  (cost=6117.33..6117.34 rows=1 width=64)
                     ->  Limit  (cost=6117.24..6117.25 rows=5 width=12)
                           ->  Sort  (cost=6117.24..6179.74 rows=25000 width=12)
                                 Sort Key: ((imdb_reviews_new111.review_embedding <-> t2.review_embedding))
                                 ->  Seq Scan on imdb_reviews_new111 t2  (cost=0.00..5702.00 rows=25000 width=12)

Once the index is created, the plan looks like the following:

                                                                       QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=137718.49..137780.99 rows=25000 width=68)
   Sort Key: imdb_reviews_new111.imdb_id
   ->  Nested Loop  (cost=5.17..135892.29 rows=25000 width=68)
         ->  Limit  (cost=0.00..5577.00 rows=25000 width=492)
               ->  Seq Scan on imdb_reviews_new111  (cost=0.00..5577.00 rows=25000 width=492)
         ->  Aggregate  (cost=5.17..5.18 rows=1 width=64)
               ->  Limit  (cost=0.00..5.09 rows=5 width=12)
                     ->  Index Scan using imdb_reviews_new111_review_embedding_idx on imdb_reviews_new111 t2  (cost=0.00..25463.06 rows=25000 width=12)
                           Order By: (review_embedding <-> imdb_reviews_new111.review_embedding)
dqii commented 10 months ago

This is probably a missing node in plan_tree_walker.c