timescale / pgvectorscale

A complement to pgvector for high performance, cost efficient vector search on large workloads.
PostgreSQL License
914 stars 39 forks source link

How can I ensure a streaming filter of diskann results rather than a table scan or other index? #109

Open mgrosso opened 1 month ago

mgrosso commented 1 month ago

Hi! Thanks for the awesome sauce.

The correct way to do a streaming filter of diskann results is unclear from the README, and I'd appreciate confirmation of a supported method.

The big motivation for me to try pgvectorscale was the "streaming post filtering" mentioned here: https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/

The only way I was able to do streaming post filtering was by selecting a filter expression for each condition and applying a huge limit along with the order by distance and then wrapping that query in an outer query which filtered on the selected expressions.

Any other approach I took gives me a table scan. Here are my replication steps:

export PG_DATA=/home/admin/tmppgdata
pg_ctl -D $PG_DATA -l start.log init
pg_ctl -D $PG_DATA -l start.log start
createdb admin
psql << EOF
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
CREATE SCHEMA "testvector";
CREATE TABLE IF NOT EXISTS "testvector"."test_embedding"  (
    id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    metadata JSONB,
    contents TEXT,
    embedding VECTOR(3)
);
CREATE INDEX test_embedding_idx ON "testvector"."test_embedding" USING diskann (embedding);
INSERT INTO "testvector"."test_embedding" (metadata, contents, embedding) VALUES ('{"name": "test1"}', 'hello world1', '[0.1, 0.3, 0.6]');
INSERT INTO "testvector"."test_embedding" (metadata, contents, embedding) VALUES ('{"name": "test2"}', 'hello world2', '[0.2, 0.4, 0.4]');
INSERT INTO "testvector"."test_embedding" (metadata, contents, embedding) VALUES ('{"name": "test3"}', 'hello world3', '[0.01, 0.2, 0.79]');
INSERT INTO "testvector"."test_embedding" (metadata, contents, embedding) VALUES ('{"name": "test4"}', 'hello world4', '[0.98, 0.01, 0.01]');
SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "testvector"."test_embedding" ORDER BY distance LIMIT 3;
EXPLAIN SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "testvector"."test_embedding" ORDER BY distance LIMIT 3;
EXPLAIN SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "testvector"."test_embedding" WHERE to_tsvector(contents) @@ plainto_tsquery('world3') ORDER BY distance LIMIT 3;
EXPLAIN SELECT x.* FROM
(
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_tsquery('world3') as filter2
        FROM "testvector"."test_embedding" ORDER BY distance
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10;
EXPLAIN SELECT x.* FROM
(
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_tsquery('world3') as filter2
        FROM "testvector"."test_embedding" ORDER BY distance LIMIT 5000000
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10
;

EOF

You'll see the first and last EXPLAIN statement use the index whereas the middle ones table scan.

I'd love to see more guidance in the README around how to ensure the query plan does a streaming filter on diskann results, as I'd like to avoid building a solution at scale whose query plan could randomly break, triggering table scans.

As a secondary issue, the query that generates the right plan isn't friendly to generate from most ORMs, but I'd happily live with that pain if I knew the query plan would be stable for that query.

Thanks! Matt


Edit: (7/13/2024) The problem is worse than I thought. running a vacuum full causes the filtering query that used diskann previously to use a table scan so there is no simple way to ensure the query uses post-filtered diskann. Perhaps it will do what I want if I load up more data and then analyze again so the table scan seems more expensive? I'll test that.

I also couldn't get pg_hint_plan to fix the issue for me, but I'm a pg_hint_plan newb so PEBKAC is a possibility.

mgrosso commented 1 month ago

the optimizer does eventually choose diskann after some repeated applications of:

insert into another_test_embedding (metadata,contents, embedding) select metadata, contents, embedding from another_test_embedding;
mgrosso commented 1 month ago

but... the query via diskann then returns 0 rows, despite there being plenty of them when a table scan is used. here are my queries and results:

admin=# EXPLAIN SELECT x.* FROM
(
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_t
squery('world3') as filter2
        FROM "another_test_embedding"
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10
;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..17322.95 rows=1 width=70)
   ->  Seq Scan on another_test_embedding  (cost=0.00..17322.95 rows=1 width=70)
         Filter: (((metadata ->> 'name'::text) = 'test3'::text) AND (to_tsvector(contents) @@ plainto_tsquery('world3'::text)))
(3 rows)

admin=# SELECT x.* FROM
(
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_t
squery('world3') as filter2
        FROM "another_test_embedding"
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10
;
 id |     metadata      |   contents   |    embedding    |      distance       | filter1 | filter2 
----+-------------------+--------------+-----------------+---------------------+---------+---------
  3 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 11 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 19 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 27 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 35 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 43 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 51 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 59 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 67 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 75 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
(10 rows)

admin=# EXPLAIN SELECT x.* FROM
(
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_t
squery('world3') as filter2
        FROM "another_test_embedding" ORDER BY distance LIMIT 5000000
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10
;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=78.46..101.44 rows=10 width=70)
   ->  Subquery Scan on x  (cost=78.46..18905.50 rows=8192 width=70)
         Filter: (x.filter1 AND x.filter2)
         ->  Limit  (cost=78.46..18577.82 rows=32768 width=70)
               ->  Index Scan using another_test_embedding_idx on another_test_embedding  (cost=78.46..18577.82 rows=32768 width=70)
                     Order By: (embedding <=> '[0.11,0.29,0.61]'::vector)
(6 rows)

admin=# SELECT x.* FROM
(
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_t
squery('world3') as filter2
        FROM "another_test_embedding" ORDER BY distance LIMIT 5000000
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10
;

 id | metadata | contents | embedding | distance | filter1 | filter2 
----+----------+----------+-----------+----------+---------+---------
(0 rows)

admin=# 

more details for replication:

admin=# \d another_test_embedding
                      Table "public.another_test_embedding"
  Column   |   Type    | Collation | Nullable |             Default              
-----------+-----------+-----------+----------+----------------------------------
 id        | bigint    |           | not null | generated by default as identity
 metadata  | jsonb     |           |          | 
 contents  | text      |           |          | 
 embedding | vector(3) |           |          | 
Indexes:
    "another_test_embedding_pkey" PRIMARY KEY, btree (id)
    "another_test_embedding_idx" diskann (embedding)

admin=# select count(*) from another_test_embedding;
 count 
-------
 32768
(1 row)

admin=# select metadata, contents, embedding, count(*) from another_test_embedding group by 1,2,3 order by 1,2,3;
     metadata      |      contents       |    embedding     | count 
-------------------+---------------------+------------------+-------
 {"name": "test1"} | hello world1        | [0.1,0.3,0.6]    |  4096
 {"name": "test2"} | hello world2        | [0.2,0.4,0.4]    |  4096
 {"name": "test3"} | hello world3        | [0.01,0.2,0.79]  |  4096
 {"name": "test4"} | hello world4        | [0.98,0.01,0.01] |  4096
 {"name": "test5"} | hello world5        | [0.15,0.35,0.5]  |  4096
 {"name": "test6"} | hello pipey |world6 | [0.25,0.45,0.3]  |  4096
 {"name": "test7"} | hello world7        | [0.01,0.25,0.74] |  4096
                   | hello world8        | [0.88,0.05,0.05] |  4096
(8 rows)

I hope this is helpful. I'm happy to try out patches or workarounds or debug gathering techniques as this is a test system dedicated to testing pgvectorscale.

mgrosso commented 1 month ago

I played with setting diskann.query_search_list_size as high as 10000 and diskann.query_rescore to 1000 and still got no rows back when the query plan uses diskann.

mgrosso commented 1 month ago

I gave it another shot, trying to rebuild the index after a vacuum full freeze. no success.

admin=# drop index another_test_embedding_idx;
DROP INDEX
admin=# vacuum full freeze another_test_embedding ;
VACUUM
admin=# CREATE INDEX another_test_embedding_idx ON another_test_embedding USING diskann (embedding);
NOTICE:  Starting index build. num_neighbors=-1 search_list_size=100, max_alpha=1.2, storage_layout=SbqCompression
WARNING:  Indexed 32768 tuples
CREATE INDEX
admin=# SELECT x.* FROM
(
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_t
squery('world3') as filter2
        FROM "another_test_embedding" ORDER BY distance LIMIT 5000000
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10
;
 id | metadata | contents | embedding | distance | filter1 | filter2 
----+----------+----------+-----------+----------+---------+---------
(0 rows)

admin=# explain SELECT x.* FROM
(
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_t
squery('world3') as filter2
        FROM "another_test_embedding" ORDER BY distance LIMIT 5000000
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10
;
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=78.46..101.44 rows=10 width=70)
   ->  Subquery Scan on x  (cost=78.46..18905.50 rows=8192 width=70)
         Filter: (x.filter1 AND x.filter2)
         ->  Limit  (cost=78.46..18577.82 rows=32768 width=70)
               ->  Index Scan using another_test_embedding_idx on another_test_embedding  (cost=78.46..18577.82 rows=32768 width=70)
                     Order By: (embedding <=> '[0.11,0.29,0.61]'::vector)
(6 rows)
mgrosso commented 1 month ago

I ran the query above that uses the index and retrieves no rows with EXPLAIN ANALYZE and saw that the actual number of rows returned from the index scan was 51. the inner limit was 5000000, and the outer limit was 10, although zero rows get to the outer limit because presumably the first 51 didn't match the filter. I'm wondering if the 51 corresponds to some setting of 50. Perhaps the off-distribution nature of the data set is exposing an edge case when a lot of vectors have the same value.

admin=*# \p                                                                                                                                                                                                                                                                        [190/1990]
explain analyze SELECT x.* FROM                                                                                                                                                                                                                                                              
(                                                                                                                                                                                                                                                                                            
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_tsquery('world3') as filter2                                                                                                                      
        FROM "another_test_embedding" ORDER BY distance LIMIT 5000000                                                                                                                                                                                                                        
) as x                                                                                                                                                                                                                                                                                       
WHERE filter1 = true AND filter2 = true                                
LIMIT 10                                                   
;                                                                                                                                                                                                                                                                                            
admin=*# \g                                                 
                                                                                   QUERY PLAN                                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=78.46..101.44 rows=10 width=70) (actual time=0.399..0.400 rows=0 loops=1)
   ->  Subquery Scan on x  (cost=78.46..18905.50 rows=8192 width=70) (actual time=0.398..0.399 rows=0 loops=1)                                
         Filter: (x.filter1 AND x.filter2)                             
         Rows Removed by Filter: 51                        
         ->  Limit  (cost=78.46..18577.82 rows=32768 width=70) (actual time=0.264..0.395 rows=51 loops=1)
               ->  Index Scan using another_test_embedding_idx2 on another_test_embedding  (cost=78.46..18577.82 rows=32768 width=70) (actual time=0.263..0.391 rows=51 loops=1)
                     Order By: (embedding <=> '[0.11,0.29,0.61]'::vector)                                                                                                                                                                                                                    
 Planning Time: 0.282 ms                                   
 Execution Time: 0.421 ms                                  
(9 rows)                                                                                                                                                                                                                                                                                     

I still get the same results even after dropping the index, doing vacuum full freeze, recreating the index with a new name, and running the query again. There are more pages in the index than there are in the base relation and the indexing takes a bit of time, so presumably there is something in there.

plenty of rows match the filter:

admin=*# SELECT count(*) FROM "another_test_embedding" WHERE metadata->>'name' = 'test3' and to_tsvector(contents) @@ plainto_tsquery('world3')                                                                                                                                    [153/1990]
;                                                                                                                                                                                                                                                                                            
 count                                                                                                                                                                                                                                                                                       
-------                                                                                                                                                                                                                                                                                      
  4096                                                                                                                                                                                                                                                                                       
(1 row)                                                                                                                                                                                                                                                                                      

Time: 11.959 ms                                            
admin=*# SELECT min(id), max(id) FROM "another_test_embedding" WHERE metadata->>'name' = 'test3' and to_tsvector(contents) @@ plainto_tsquery('world3')                                                                                                                                      
;                                                           
 min |  max                                                                                                                                                                                                                                                                                  
-----+-------                                                                                                                                                                                                                                                                                
   3 | 32763                                                                                                                                  
(1 row)                                                                                                                                       

rebuilding the index after a vacuum full freeze doesn't fix it:

 admin=# drop index another_test_embedding_idx2 ;                                                                                            

DROP INDEX                                                 
Time: 3.010 ms                                             
admin=# vacuum full freeze another_test_embedding;         
VACUUM                                                                                                                                      
Time: 37.862 ms                                                      
admin=# CREATE INDEX another_test_embedding_idx3 ON another_test_embedding USING diskann (embedding);                                       

NOTICE:  Starting index build. num_neighbors=-1 search_list_size=100, max_alpha=1.2, storage_layout=SbqCompression                          
WARNING:  Indexed 32768 tuples                                                                                                              
CREATE INDEX                                                                                                                                
Time: 23393.192 ms (00:23.393)                                                                                                              
admin=# SELECT x.* FROM            

(                                                                                                                                           
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_t
squery('world3') as filter2
        FROM "another_test_embedding" ORDER BY distance LIMIT 5000000                                                                       
) as x                                                                                                                                      
WHERE filter1 = true AND filter2 = true                                                                                                     
LIMIT 10                                                                                                                                    
;                                                                                                                                           
 id | metadata | contents | embedding | distance | filter1 | filter2                                                                        
----+----------+----------+-----------+----------+---------+---------                                                                       
(0 rows)                                                                                                                                    

Time: 0.949 ms                                                                                                                              

and both of the test tables have the right number of tuples in all the indices and a believable number of pages:

admin=# select relname, reltuples, relpages from pg_class where relname like '%test_embed%';                                                

            relname            | reltuples | relpages                                                                                       
-------------------------------+-----------+----------                                                                                      
 test_embedding_id_seq         |         1 |        1                                                                                       
 test_embedding                |         8 |        1                                                                                       
 test_embedding_pkey           |         8 |        2                                                                                       
 test_embedding_idx            |         8 |        3                                                                                       
 another_test_embedding_id_seq |         1 |        1                                                                                       
 another_test_embedding        |     32768 |      365                                                                                       
 another_test_embedding_pkey   |     32768 |       92                                                                                       
 another_test_embedding_idx3   |     32768 |     1823                                                                                       
(8 rows)                                                                                                                                    
mgrosso commented 1 month ago

I'll probably open a separate issue to track the "no results" thing separately from "how to ensure it uses diskann with a streaming filter" because you could probably address the initial issue with a README update showing how you ensured streaming diskann filters in practice but the "no results" for this dataset and query issue seems more like a bug which is only triggered once you've figured out how to get it to use diskann.

mgrosso commented 1 month ago

If I drop the filter and just use the inner part of the query it will correctly return some of the rows whose vector is close to the query but which do not match the filter.

admin=# EXPLAIN SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 5;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=78.46..78.74 rows=5 width=68)
   ->  Index Scan using another_test_embedding_idx3 on another_test_embedding  (cost=78.46..1948.06 rows=32768 width=68)
         Order By: (embedding <=> '[0.11,0.29,0.61]'::vector)
(3 rows)

Time: 0.405 ms
admin=# EXPLAIN ANALYZE SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 5;
                                                                            QUERY PLAN                                                                             
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=78.46..78.74 rows=5 width=68) (actual time=0.152..0.156 rows=5 loops=1)
   ->  Index Scan using another_test_embedding_idx3 on another_test_embedding  (cost=78.46..1948.06 rows=32768 width=68) (actual time=0.151..0.155 rows=5 loops=1)
         Order By: (embedding <=> '[0.11,0.29,0.61]'::vector)
 Planning Time: 0.153 ms
 Execution Time: 0.172 ms
(5 rows)

Time: 0.549 ms
admin=# SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 5;
 id  |     metadata      |   contents   |   embedding   |       distance        
-----+-------------------+--------------+---------------+-----------------------
  33 | {"name": "test1"} | hello world1 | [0.1,0.3,0.6] | 0.0002832322335062365
  41 | {"name": "test1"} | hello world1 | [0.1,0.3,0.6] | 0.0002832322335062365
  81 | {"name": "test1"} | hello world1 | [0.1,0.3,0.6] | 0.0002832322335062365
 153 | {"name": "test1"} | hello world1 | [0.1,0.3,0.6] | 0.0002832322335062365
 289 | {"name": "test1"} | hello world1 | [0.1,0.3,0.6] | 0.0002832322335062365
(5 rows)
mgrosso commented 1 month ago

but I can't get more than 51 rows out of the diskann index. This is the fundamental issue and it lands us back in pgvector HNSW territory, similar to issue https://github.com/pgvector/pgvector/issues/263

Of course we could say, "just use a regular index on the filters" but we need to be able to use a column indexed with diskann and get reasonable results back even when the index scan was not the optimal query plan, and this is a test designed to determine that.

admin=# SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 59) as x;
 count 
-------
    51
(1 row)

Time: 0.631 ms
mgrosso commented 1 month ago

the two guc parameters don't influence the 51.

admin=# begin; set local diskann.query_search_list_size = 10000;
BEGIN
Time: 0.128 ms
SET
Time: 0.066 ms
admin=*# SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 59) as x;
 count 
-------
    51
(1 row)

Time: 0.696 ms
admin=*# set local diskann.query_rescore = 1000;
SET
Time: 0.142 ms
admin=*# SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 59) as x;
 count 
-------
    51
(1 row)
mgrosso commented 1 month ago

ok, so the 51 was just 1 + num_neighbors, or at least the default value of 50.

admin=# drop index another_test_embedding_idx3;
DROP INDEX                                                                                                                                                                                                                                                                                   
Time: 3.126 ms
admin=# create index another_test_embedding_idx4 on another_test_embedding USING diskann ( embedding ) WITH (num_neighbors = 1000);
NOTICE:  Starting index build. num_neighbors=1000 search_list_size=100, max_alpha=1.2, storage_layout=SbqCompression
WARNING:  Indexed 32768 tuples
CREATE INDEX
Time: 51039.734 ms (00:51.040)
admin=# SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 59) as x;            
 count                                                                                                                                        
-------                                                                                                                                       
    59                                            
(1 row)                                                   

so a value of nearest_neighbors that is > the limit allows the limit to work, but the bug here isn't that simple. I expected a limit of 2000 to return 1001, the new value of 1 + nearest_neighbors but no...

admin=# explain SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 2000) as x;
                                                            QUERY PLAN                                                             
----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1453.57..1453.58 rows=1 width=8)
   ->  Limit  (cost=1314.46..1428.57 rows=2000 width=112)
         ->  Index Scan using another_test_embedding_idx4 on another_test_embedding  (cost=1314.46..3184.06 rows=32768 width=112)
               Order By: (embedding <=> '[0.11,0.29,0.61]'::vector)
(4 rows)

Time: 0.426 ms
admin=# explain analyze SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 2000) as x;
                                                                                   QUERY PLAN                                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1453.57..1453.58 rows=1 width=8) (actual time=53.471..53.472 rows=1 loops=1)
   ->  Limit  (cost=1314.46..1428.57 rows=2000 width=112) (actual time=51.766..53.399 rows=2000 loops=1)
         ->  Index Scan using another_test_embedding_idx4 on another_test_embedding  (cost=1314.46..3184.06 rows=32768 width=112) (actual time=51.765..53.287 rows=2000 loops=1)
               Order By: (embedding <=> '[0.11,0.29,0.61]'::vector)
 Planning Time: 0.181 ms
 Execution Time: 53.496 ms
(6 rows)

Time: 53.997 ms
admin=# SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 2000) as x;
 count 
-------
  2000
(1 row)

Time: 38.458 ms
admin=# explain SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 200000) as x;
                                                            QUERY PLAN                                                             
----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3593.66..3593.67 rows=1 width=8)
   ->  Limit  (cost=1314.46..3184.06 rows=32768 width=112)
         ->  Index Scan using another_test_embedding_idx4 on another_test_embedding  (cost=1314.46..3184.06 rows=32768 width=112)
               Order By: (embedding <=> '[0.11,0.29,0.61]'::vector)
(4 rows)

Time: 0.447 ms
admin=# explain analyze SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 200000) as x;
                                                                                    QUERY PLAN                                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3593.66..3593.67 rows=1 width=8) (actual time=305.061..305.063 rows=1 loops=1)
   ->  Limit  (cost=1314.46..3184.06 rows=32768 width=112) (actual time=29.899..303.223 rows=32768 loops=1)
         ->  Index Scan using another_test_embedding_idx4 on another_test_embedding  (cost=1314.46..3184.06 rows=32768 width=112) (actual time=29.898..301.072 rows=32768 loops=1)
               Order By: (embedding <=> '[0.11,0.29,0.61]'::vector)
 Planning Time: 0.167 ms
 Execution Time: 305.085 ms
(6 rows)

admin=# SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "another_test_embedding" ORDER BY distance LIMIT 200000) as x;
 count 
-------
 32768
(1 row)

Since the table has 32768 rows that last count is perfect.

As a bonus, our original "no rows, should be rows" query returns rows now:

admin=# explain analyze SELECT x.* FROM
(
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_tsquery('world3') as filter2
        FROM "another_test_embedding" ORDER BY distance LIMIT 5000000
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10
;
                                                                                       QUERY PLAN                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1314.46..1337.44 rows=10 width=70) (actual time=151.473..151.503 rows=10 loops=1)
   ->  Subquery Scan on x  (cost=1314.46..20141.50 rows=8192 width=70) (actual time=151.472..151.501 rows=10 loops=1)
         Filter: (x.filter1 AND x.filter2)
         Rows Removed by Filter: 8192
         ->  Limit  (cost=1314.46..19813.82 rows=32768 width=70) (actual time=50.327..151.029 rows=8202 loops=1)
               ->  Index Scan using another_test_embedding_idx4 on another_test_embedding  (cost=1314.46..19813.82 rows=32768 width=70) (actual time=50.327..150.444 rows=8202 loops=1)
                     Order By: (embedding <=> '[0.11,0.29,0.61]'::vector)
 Planning Time: 0.280 ms
 Execution Time: 151.528 ms
(9 rows)

Time: 152.165 ms
admin=# SELECT x.* FROM
(
        SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_tsquery('world3') as filter2
        FROM "another_test_embedding" ORDER BY distance LIMIT 5000000
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10
;
  id   |     metadata      |   contents   |    embedding    |      distance       | filter1 | filter2 
-------+-------------------+--------------+-----------------+---------------------+---------+---------
  5843 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 14627 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 10379 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
  7139 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
  7979 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 27155 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 24611 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 20987 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
  4107 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
 11123 | {"name": "test3"} | hello world3 | [0.01,0.2,0.79] | 0.02996869959321613 | t       | t
(10 rows)

Time: 132.784 ms
admin=# 

As you might expect it's not a small index with 1000 nearest neighbors...

admin=# select relname, reltuples, relpages from pg_class where relname like 'another_test_embed%';
            relname            | reltuples | relpages 
-------------------------------+-----------+----------
 another_test_embedding        |     32768 |      365
 another_test_embedding_id_seq |         1 |        1
 another_test_embedding_idx4   |     32768 |    32770
 another_test_embedding_pkey   |     32768 |       92
(4 rows)

but at least it handles the filtering edge case.

I'm a little worried about using 1000 for this parameter, and also wondering about interactions with quantization and so forth, so I'll play around some more before creating the second ticket.

mgrosso commented 1 month ago

ok, some progress in pinning this down.

  1. using the default num_neighbors=50 with storage_layout='plain' made no difference, same failure mode of the count that should return 32768 returning 51.
  2. at num_neighbors=112, default storage layout, the count query above was correctly 32768, using the index scan, and the other filtered query above correctly got past the 8192 more relevant results to find 10 of the best filter matching results.
  3. at num_neighbors=100, the inner index scan never returns more than 101, so we have the same bug, and the count returns 101 and the filtered query returns no rows.
  4. at num_neighbors=101 the innder index scan never returns more than 102 rows, so same buggy responses.
  5. at num_neighbors=102 the IndexScan stops at 4092!?
  6. at num_neighbors=106 the index scan stops at 32475!

This has been a wild ride so far. I had a hunch that num_neighbors might need to be more than search_list_size, but the "almost power of two" thing happening at num_neighbors 102+ is trippy.

mgrosso commented 1 month ago

ok, so the maximum number of unfiltered rows returned by the IndexScan of diskann index with other settings at default was:

num_neighbors, max rows
50, 51
100, 101
101, 102
102, 4092
103, 4096
104, 12288
105, 24576
106, 32475
107, 32478
108, 32468
109, 32768
110, 32768
112, 32768

Note there were only 32768 rows for these tests, so I'll add rows shortly. Also, note that this function is not monotonically increasing; see 105->106, and 107-> 108.

mgrosso commented 1 month ago

last note here before I file the other ticket: when running against a new table with similar data double until there are 1048576 rows, a setting of 109 nearest neighbors is sufficient for the Index Scan to correctly return all of the rows to a surrounding filter. The query was SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "test_embedding" ORDER BY distance LIMIT 2000000) as x; in this case with the result correctly being 1048576.

cevian commented 1 month ago

replied in #110 lets continue the conversation there