Recommendations for tuning with large dataset

markwillowtree commented 3 weeks ago

I have a table with 170 million rows and a vector(512) column,

I started building the index with the default settings, but after 25 minutes it appears as though it is only 1% of the way through building the graph!

SELECT phase, round(100.0 * blocks_done / nullif(blocks_total, 0), 3) AS "%" FROM pg_stat_progress_create_index;

What index settings would be suitable for such a large amount of data?

jgpruitt commented 3 weeks ago

At the moment, indexing millions of vectors is going to be a resource and time intensive operation regardless of the vector database technology you choose.

As of version 0.2.0, a 50 million vector dataset of 768 dimensions took roughly 2 days to index on an unpartitioned table using the default index build parameters using a machine with 64 CPUs, 128 GB of RAM, and the data directory placed on a local SSD.

For comparison, a considerably more expensive Pinecone setup took roughly 1 day to index the same dataset. However, it performed worse than pgvectorscale at query time.

See our blog post for more details.

You can experiment with the index build parameters, which may speed things up, but you will likely trade off some accuracy at query time. This may or may not be acceptible to you.

You can also try partitioning the table and building indexes on the partitions in parallel. This will likely also negatively impact accuracy, but might be okay for you. Smaller partitions will index more quickly, but will probably have a more negative impact on accuracy.

markwillowtree commented 3 weeks ago

That's extremely helpful, thanks John.

timescale / pgvectorscale

Recommendations for tuning with large dataset #99