mgrosso commented 1 month ago

(Note: This bug was discovered while documenting https://github.com/timescale/pgvectorscale/issues/109 and most of my first 13 comments there relate to this.)

index scan terminates early

summary

With default index build settings including num_neighbors , an index scan of a diskann index will terminate after 51 rows; a query in which that index scan is a child of a filter that filters the 51 most relevant rows will not return any rows at all.
With index num_neighbors build settings from 102 through 108 the maximum rows returned by the index scan increases in non-linear and non-monotonic fashion detailed below, at least with the replication steps I followed.
increasing num_neighbors to 109 fixes the issue for the replication steps I went through, even when the number of rows in the table was increased to 2^20.

replication steps

tl;dr: we insert 8 distinct rows, then double them until we have 32768, giving us 4k relevant rows to filter before we get to the less relevant rows that match our filter.

test data set

Those initial rows will look like this:

admin=# select * from test_embedding;
 id |     metadata      |      contents       |    embedding                                                                         
----+-------------------+---------------------+------------------
  1 | {"name": "test1"} | hello world1        | [0.1,0.3,0.6]                                                                        
  2 | {"name": "test2"} | hello world2        | [0.2,0.4,0.4]
  3 | {"name": "test3"} | hello world3        | [0.01,0.2,0.79]                                                                      
  4 | {"name": "test4"} | hello world4        | [0.98,0.01,0.01]
  5 | {"name": "test5"} | hello world5        | [0.15,0.35,0.5]                                                                      
  6 | {"name": "test6"} | hello pipey |world6 | [0.25,0.45,0.3]                                                                      
  7 | {"name": "test7"} | hello world7        | [0.01,0.25,0.74]                                                                     
  8 |                   | hello world8        | [0.88,0.05,0.05]
(8 rows)

I created them more or less like this:

 cat > /home/yournamehere/testdata.txt << EOF
{"name": "test5"}|hello world5|[0.15, 0.35, 0.5]
{"name": "test6"}|hello pipey \|world6|[0.25, 0.45, 0.3]
{"name": "test7"}|hello world7|[0.01, 0.25, 0.74]
\N|hello world8|[0.88, 0.05, 0.05]
EOF

psql << EOF
CREATE TABLE test_embedding (
    id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    metadata JSONB,
    contents TEXT,
    embedding VECTOR(3)
);
INSERT INTO "test_embedding" (metadata, contents, embedding) VALUES ('{"name": "test1"}', 'hello world1', '[0.1, 0.3, 0.6]');
INSERT INTO "test_embedding" (metadata, contents, embedding) VALUES ('{"name": "test2"}', 'hello world2', '[0.2, 0.4, 0.4]');
INSERT INTO "test_embedding" (metadata, contents, embedding) VALUES ('{"name": "test3"}', 'hello world3', '[0.01, 0.2, 0.79]');
INSERT INTO "test_embedding" (metadata, contents, embedding) VALUES ('{"name": "test4"}', 'hello world4', '[0.98, 0.01, 0.01]');
COPY "test_embedding" (metadata, contents, embedding) FROM '/home/yournamehere/testdata.txt' WITH (FORMAT TEXT, DELIMITER '|');
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;
INSERT INTO test_embedding (metadata, contents, embedding) SELECT metadata, contents, embedding FROM test_embedding;

replicate the problem

here we create the index, and the run two different queries, each of those two queries with and without explain analyze to confirm the diskann is being used. The EXPLAIN ANALYZE also let's us confirm that when we get 0 rows back instead of 10 or when the count is wrong it's because of the number of rows returned by the index scan.

create the index.

CREATE INDEX test_embedding_diskann ON test_embedding USING diskann (embedding);

replicate the problem using a simple outer count(*) query with no filters:

explain analyze SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "test_embedding" ORDER BY distance LIMIT 2000000) as x;
SELECT COUNT(*) FROM (SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance FROM "test_embedding" ORDER BY distance LIMIT 2000000) as x;

replicate the problem using an outer query that checks filters, intended to duplicate the enterprise use case in which a filter is highly correlated with relevance but the most relevant documents are filtered for permissions or other reasons:

explain analyze SELECT x.* FROM
(
       SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_tsquery('world3') as filter2
       FROM "test_embedding" ORDER BY distance LIMIT 5000000
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10
;
SELECT x.* FROM
(
       SELECT *, embedding <=> '[0.11, 0.29, 0.61]' as distance, metadata->>'name' = 'test3' as filter1, to_tsvector(contents) @@ plainto_tsquery('world3') as filter2
       FROM "test_embedding" ORDER BY distance LIMIT 5000000
) as x
WHERE filter1 = true AND filter2 = true
LIMIT 10
;

exploring the parameter space

if you repeatedly drop the index and rebuild it using a WITH (num_neighbors = $X) for different values of X, here is what you should get, using the dataset and steps above:

num_neighbors, max rows
50, 51
100, 101
101, 102
102, 4092
103, 4096
104, 12288
105, 24576
106, 32475
107, 32478
108, 32468
109, 32768
110, 32768
112, 32768

see https://github.com/timescale/pgvectorscale/issues/109#issuecomment-2227620018

mitigation

setting num_neighbors to 109 fixed the issue for my test datasets so far, eg:

CREATE INDEX test_embedding_diskann_109 ON test_embedding USING diskann (embedding) WITH (num_neighbors=109);

cevian commented 1 month ago

@mgrosso sorry for the delay in responding. My initial impression is that this is a very weird, non-representative dataset. First of all because the number of dimensions is only 3 and second because the rows repeat a lot. The latter especially is problematic because there will be a lot of rows with a distance of 0. The way the graph algorithm works is that all nodes store their closest neighboring nodes first, and then the further nodes after that. Given that there would be 4096 nodes with a distance of 0 that means a num_neighbors of 50, you can see how that may be a problem. What I am guessing is happening here is actually that the graph is becoming disconnected.

Before we debug further may I ask if this is the actual dataset you plan to use and if so, why? I am guessing you simply won't see the issues if you use a real embedding dataset.

mgrosso commented 1 month ago

Hi @cevian,

This is definitely a test dataset intended to prove that we can filter past a large number of results to get to the best results (approximately) even when a large number of closer results exist that don't match the filter, since we do expect that scenario to happen sometimes for our use case.

I'll reconstruct test data using some random noise so we don't have large numbers of dupes but we still challenge the ability to filter past a sizable percentage of closer vectors. If that reproduces, then I can see if the vector dimension makes a difference since I expect to use more typical embedding dimensions. Using a dimension of three just made the test cases easier to construct.

side note: I wouldn't expect to see this issue with the benchmark data sets I've found so far, but I also haven't found any that I feel do a good job of representing the relevance correlated filtering weirdness of enterprise data at scale, and I was specifically worried about the ability to filter past a large number of rows in a diskann index scan when necessary, having been burnt with HNSW.

mgrosso commented 1 month ago

@cevian I've replicated the issue with a dimension of 128, but keeping the data in 8 clusters to ensure we can replicate the main issue where the query has to find rows from the second most relevant cluster, requiring a significant number of rows to be returned by the index scan.

The select count(*) from (...) query plan stopped using the index at this dimension but the one that matters which has to find and return rows from the second most relevant cluster still uses the diskann index and fails to return rows with default num_neighbors.

To prevent a any distances of 0 I've added jitter to every element. I'll add download links and detailed SQL steps and output shortly, as well as code to regenerate the data set.

mgrosso commented 1 month ago

@cevian I've provided steps to replicate, test data, and code to regenerate the test data all here: https://github.com/mgrosso/issue110-pgvectorscale since it was too big for a gist.

cevian commented 1 month ago

@mgrosso thanks for that thorough replication. This line

Index Scan using test_embedding_dim128_diskann on test_embedding_dim128  (cost=82.46..27361.82 rows=32768 width=573) (actual time=0.162..0.337 rows=51 loops=1)

actually gives me a good clue as to what is happening because it returns only 51 rows. I think the first node in the graph has a cluster that's cut off from the rest of the graph. While the algorithm in general has no guarantees of connectedness I think there should be something I can do to fix this issue (I think its an edge case somewhere). Let me take a look and think about this a bit)

cevian commented 1 month ago

Just a note: I am still working on this

cevian commented 1 month ago

@mgrosso let me know if you can test PR #111 . I tested with the dataset in the link above, and it seems to fix the problem but a confirmation from you would be nice

raulcarlomagno commented 1 month ago

51 rows returned issue happened to me as well i thought that index creation was interrupted somehow so i indexed again, but it happened again

cevian commented 1 month ago

@raulcarlomagno is that on the release or after pr #111 ?

raulcarlomagno commented 1 month ago

@raulcarlomagno is that on the release or after pr #111 ? in 0.2.0

raulcarlomagno commented 1 month ago

i can add more information it happened with a 30 million records table having 14 partitions, the biggest partition has 20 million records, this one is returning 51 records

Index Scan using table_office_cn_embedding_idx on table_office_cn table_3 (cost=205900.77..77927506.77 rows=20436960 width=29) (actual time=0.118..0.118 rows=1 loops=1)

mgrosso commented 1 month ago

@cevian I'll definitely give this a test this weekend just to be sure, but it sounds like you were able to replicate the problem with more or less the same test data and repair it so I'm sure I'll get the same positive results. @raulcarlomagno , with the test data I used to replicate this issue, rebuilding the index with num_neigbors set to 109 or more fixed the problem. If you need a workaround before PR #111 is merged and can't try that branch then I recommend giving the index rebuild with larger num_neighbors a try. It might tide you over until the next release

mgrosso commented 3 weeks ago

@cevian : PR #111 fixed the problem for me. thank you.

mgrosso commented 3 weeks ago

please feel free to close this when that gets merged or whenever it makes sense for your current process.

cevian commented 1 week ago

The fix to this has been released in 0.3.0

timescale / pgvectorscale

Index Scan only returning 51 rows to surrounding filtered query with num_neighbors default of 50. #110

index scan terminates early

summary

replication steps

test data set

replicate the problem

exploring the parameter space

mitigation