parallel indexing is not giving the same recall percentage as serial indexing

msk-apk commented 1 month ago

I executed the two following test cases to replicate the scale test as documented in pgvectorscale

created a table search_item with a vector column of dimension 768. Created the diskann index with the following command.

CREATE INDEX document_embedding_idx ON document_embedding USING diskann (embedding);

After creating the index, connected 40 connections each inserting 500000 entries in the search_item table with copy command. All copy commands completed in 90 minutes. In this approach, 40 clients are concurrently inserting data and diskann index is updated parallally by 40 processes. Totally search_item has 20 M entries.

Tried querying search_item table with QUERY_RESULT_SIZE as 100. I am able to get the result quickly with in 50 ms but the recall percentage is very low its around 40%

created a table search_item with a vector of dimension 768. Connected 40 clients to insert 20 M data into this table. Now created the index (i.e after updating the data to the table search_item) with the same command.

CREATE INDEX document_embedding_idx ON document_embedding USING diskann (embedding);

This time it took long time to build the index

DEBUG: Writing took 82451.193284036s or 0.0041225596642018s/tuple. Avg neighbors: 50 DEBUG: When pruned for cleanup: avg neighbors before/after 56/0 of 18461082 prunes WARNING: Indexed 20000000 tuples DEBUG: EventTriggerInvoke 16788

But when I search with the same QUERY_RESULT_SIZE, i got around 99% recall with 90 ms mean query time.

The question is why the recall is very low in the first test, is it possible to improve the recall percentage in the first scenario?

msk-apk commented 1 month ago

Any reasons for not getting the expected recall on doing parallel indexing? We cant build the index periodically right?

cevian commented 1 month ago

@msk-apk I think the issue here is actually with the SBQ compression algorithm. It learns about the distribution of the data when the index is first built. So building the index on the empty table will produce much worse results than building it on a table that has data already in it.

There are many solutions we could try to implement but I think rebuilding the index periodically is the best approach for now.

msk-apk commented 1 month ago

yeah seems that is the case here. changing storage_type as plain, resolves the issue.

timescale / pgvectorscale

parallel indexing is not giving the same recall percentage as serial indexing #108