Closed mgrosso closed 1 week ago
@mgrosso sorry for the delay in responding. My initial impression is that this is a very weird, non-representative dataset. First of all because the number of dimensions is only 3 and second because the rows repeat a lot. The latter especially is problematic because there will be a lot of rows with a distance of 0. The way the graph algorithm works is that all nodes store their closest neighboring nodes first, and then the further nodes after that. Given that there would be 4096 nodes with a distance of 0 that means a num_neighbors of 50, you can see how that may be a problem. What I am guessing is happening here is actually that the graph is becoming disconnected.
Before we debug further may I ask if this is the actual dataset you plan to use and if so, why? I am guessing you simply won't see the issues if you use a real embedding dataset.
Hi @cevian,
This is definitely a test dataset intended to prove that we can filter past a large number of results to get to the best results (approximately) even when a large number of closer results exist that don't match the filter, since we do expect that scenario to happen sometimes for our use case.
I'll reconstruct test data using some random noise so we don't have large numbers of dupes but we still challenge the ability to filter past a sizable percentage of closer vectors. If that reproduces, then I can see if the vector dimension makes a difference since I expect to use more typical embedding dimensions. Using a dimension of three just made the test cases easier to construct.
side note: I wouldn't expect to see this issue with the benchmark data sets I've found so far, but I also haven't found any that I feel do a good job of representing the relevance correlated filtering weirdness of enterprise data at scale, and I was specifically worried about the ability to filter past a large number of rows in a diskann index scan when necessary, having been burnt with HNSW.
@cevian I've replicated the issue with a dimension of 128, but keeping the data in 8 clusters to ensure we can replicate the main issue where the query has to find rows from the second most relevant cluster, requiring a significant number of rows to be returned by the index scan.
The select count(*) from (...)
query plan stopped using the index at this dimension but the one that matters which has to find and return rows from the second most relevant cluster still uses the diskann index and fails to return rows with default num_neighbors
.
To prevent a any distances of 0 I've added jitter to every element. I'll add download links and detailed SQL steps and output shortly, as well as code to regenerate the data set.
@cevian I've provided steps to replicate, test data, and code to regenerate the test data all here: https://github.com/mgrosso/issue110-pgvectorscale since it was too big for a gist.
@mgrosso thanks for that thorough replication. This line
Index Scan using test_embedding_dim128_diskann on test_embedding_dim128 (cost=82.46..27361.82 rows=32768 width=573) (actual time=0.162..0.337 rows=51 loops=1)
actually gives me a good clue as to what is happening because it returns only 51 rows. I think the first node in the graph has a cluster that's cut off from the rest of the graph. While the algorithm in general has no guarantees of connectedness I think there should be something I can do to fix this issue (I think its an edge case somewhere). Let me take a look and think about this a bit)
Just a note: I am still working on this
@mgrosso let me know if you can test PR #111 . I tested with the dataset in the link above, and it seems to fix the problem but a confirmation from you would be nice
51 rows returned issue happened to me as well i thought that index creation was interrupted somehow so i indexed again, but it happened again
@raulcarlomagno is that on the release or after pr #111 ?
@raulcarlomagno is that on the release or after pr #111 ? in 0.2.0
i can add more information it happened with a 30 million records table having 14 partitions, the biggest partition has 20 million records, this one is returning 51 records
Index Scan using table_office_cn_embedding_idx on table_office_cn table_3 (cost=205900.77..77927506.77 rows=20436960 width=29) (actual time=0.118..0.118 rows=1 loops=1)
@cevian I'll definitely give this a test this weekend just to be sure, but it sounds like you were able to replicate the problem with more or less the same test data and repair it so I'm sure I'll get the same positive results.
@raulcarlomagno , with the test data I used to replicate this issue, rebuilding the index with num_neigbors
set to 109 or more fixed the problem. If you need a workaround before PR #111 is merged and can't try that branch then I recommend giving the index rebuild with larger num_neighbors
a try. It might tide you over until the next release
@cevian : PR #111 fixed the problem for me. thank you.
please feel free to close this when that gets merged or whenever it makes sense for your current process.
The fix to this has been released in 0.3.0
(Note: This bug was discovered while documenting https://github.com/timescale/pgvectorscale/issues/109 and most of my first 13 comments there relate to this.)
index scan terminates early
summary
num_neighbors
, an index scan of adiskann
index will terminate after 51 rows; a query in which that index scan is a child of a filter that filters the 51 most relevant rows will not return any rows at all.num_neighbors
build settings from 102 through 108 the maximum rows returned by the index scan increases in non-linear and non-monotonic fashion detailed below, at least with the replication steps I followed.num_neighbors
to 109 fixes the issue for the replication steps I went through, even when the number of rows in the table was increased to 2^20.replication steps
tl;dr: we insert 8 distinct rows, then double them until we have 32768, giving us 4k relevant rows to filter before we get to the less relevant rows that match our filter.
test data set
Those initial rows will look like this:
I created them more or less like this:
replicate the problem
here we create the index, and the run two different queries, each of those two queries with and without
explain analyze
to confirm the diskann is being used. TheEXPLAIN ANALYZE
also let's us confirm that when we get 0 rows back instead of 10 or when the count is wrong it's because of the number of rows returned by the index scan.create the index.
replicate the problem using a simple outer count(*) query with no filters:
replicate the problem using an outer query that checks filters, intended to duplicate the enterprise use case in which a filter is highly correlated with relevance but the most relevant documents are filtered for permissions or other reasons:
exploring the parameter space
if you repeatedly drop the index and rebuild it using a
WITH (num_neighbors = $X)
for different values of X, here is what you should get, using the dataset and steps above:see https://github.com/timescale/pgvectorscale/issues/109#issuecomment-2227620018
mitigation
setting
num_neighbors
to 109 fixed the issue for my test datasets so far, eg:CREATE INDEX test_embedding_diskann_109 ON test_embedding USING diskann (embedding) WITH (num_neighbors=109);