timescale / pgvectorscale

A complement to pgvector for high performance, cost efficient vector search on large workloads.
PostgreSQL License
991 stars 45 forks source link

Question about memory_optimized storage layout #122

Open agandra30 opened 3 weeks ago

agandra30 commented 3 weeks ago

Need some inputs , I encountered an issue even after i have set the. storage_layout='plain' , my understanding is that when set to plain it should not to use SBQ or to set the bit_per_dimension=2

My dataset set is cohere and dimension is 768 Dim python 3.11 postgres 16 Name | Version | Schema | Description
-------------+---------+------------+---------------------------------------------------------------------------------------------- plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language vector | 0.7.4 | public | vector data type and ivfflat and hnsw access methods vectors | 0.3.0 | vectors | vectors: Vector database plugin for Postgres, written in Rust, specifically designed for LLM vectorscale | 0.3.0 | public | pgvectorscale: Advanced indexing for vector data

psycopg.errors.InternalError_: SBQ with more than 1 bit per dimension is only supported with the memory_optimized storage layout.

is it required to set the bit_per_dimension and also use only storage_layout='memory_optmized'.

Thank you in advance

cevian commented 3 weeks ago

@agandra30 this is indeed a bug. You can get around it by setting num_bits_per_dimension=1 explicitly when using storage_layout='plain'. I'll submit a PR to fix this soon.

agandra30 commented 3 weeks ago

@cevian , thanks for the addressing the problem

My observations are even after setting it to num_bits_dimension=1 there is no much of a progress , the index creation just get stucks and no progress for hours

I even tried with storage_layout='memory_optimized', even it fails to create an index and hangs in there for really long time.

Not sure if there is any optimisations needs to be set the DB side ? confused if this could hinder, if used for production usecases.

pgrustscale=# SELECT COUNT(*) FROM pg_vectorscale_collection;
  count  
---------
 1000000
(1 row)

pgrustscale=# CREATE INDEX IF NOT EXISTS  "pgvectorscale_index"  ON public. "pg_vectorscale_collection"  
            USING  "diskann"  (embedding  "vector_cosine_ops" )
             WITH ( "storage_layout" = "memory_optimized", "num_neighbors" = "50", "search_list_size" = "100", "max_alpha" = "1.2", "num_bits_per_dimension" = "1"   );
NOTICE:  Starting index build. num_neighbors=50 search_list_size=100, max_alpha=1.2, storage_layout=SbqCompression
cevian commented 3 weeks ago

A million vectors can take a while to index.

If you do SET client_min_messages = DEBUG1; before the create index statement you should see progress information.

Also I would replace USING "diskann" (embedding "vector_cosine_ops" ) with USING "diskann" (embedding) I never use the ops in the statement and I don't know if it messes anything up.