Open msk-apk opened 1 month ago
Any reasons for not getting the expected recall on doing parallel indexing? We cant build the index periodically right?
@msk-apk I think the issue here is actually with the SBQ compression algorithm. It learns about the distribution of the data when the index is first built. So building the index on the empty table will produce much worse results than building it on a table that has data already in it.
There are many solutions we could try to implement but I think rebuilding the index periodically is the best approach for now.
yeah seems that is the case here. changing storage_type as plain, resolves the issue.
I executed the two following test cases to replicate the scale test as documented in pgvectorscale
CREATE INDEX document_embedding_idx ON document_embedding USING diskann (embedding);
After creating the index, connected 40 connections each inserting 500000 entries in the search_item table with copy command. All copy commands completed in 90 minutes. In this approach, 40 clients are concurrently inserting data and diskann index is updated parallally by 40 processes. Totally search_item has 20 M entries.
Tried querying search_item table with QUERY_RESULT_SIZE as 100. I am able to get the result quickly with in 50 ms but the recall percentage is very low its around 40%
CREATE INDEX document_embedding_idx ON document_embedding USING diskann (embedding);
This time it took long time to build the index
DEBUG: Writing took 82451.193284036s or 0.0041225596642018s/tuple. Avg neighbors: 50 DEBUG: When pruned for cleanup: avg neighbors before/after 56/0 of 18461082 prunes WARNING: Indexed 20000000 tuples DEBUG: EventTriggerInvoke 16788
But when I search with the same QUERY_RESULT_SIZE, i got around 99% recall with 90 ms mean query time.
The question is why the recall is very low in the first test, is it possible to improve the recall percentage in the first scenario?