tensorchord / pgvecto.rs

Scalable, Low-latency and Hybrid-enabled Vector Search in Postgres. Revolutionize Vector Search, not Database.
https://docs.pgvecto.rs/getting-started/overview.html
Apache License 2.0
1.75k stars 71 forks source link

why ivf index so big? #601

Open huaimin016 opened 1 month ago

huaimin016 commented 1 month ago

I use select * from pg_vector_index_stat; to checkout index size, here is the result:

19176892    19177456    pgvecto_rs_90_ivf   pgvecto_rs_90_ivf_vector_idx    NORMAL  false   2781540 {2781540}   {}  0   1580896555  {"vector":{"dimensions":512,"vector":"Veci8","distance":"Cos"},"segment":{"max_growing_segment_size":20000,"max_sealed_segment_size":30000000},"indexing":{"ivf":{"least_iterations":16,"iterations":500,"nlist":1000,"nsample":65536,"quantization":{"trivial":{}}}}}

In disk:

postgres@dddddd:~/16/main/pg_vectors/indexes/0000000000000000000000000000000066d137b44b7ae9220000000501249ff0$ du . -h
21M ./segments/51f8e269-845f-4154-8558-629302f1fa7f/indexing/quantization
1.5G    ./segments/51f8e269-845f-4154-8558-629302f1fa7f/indexing/raw
1.5G    ./segments/51f8e269-845f-4154-8558-629302f1fa7f/indexing
1.5G    ./segments/51f8e269-845f-4154-8558-629302f1fa7f
1.5G    ./segments
8.0K    ./startup
1.5G    .

why need raw index data? This increases the memory usage a lot. Is it necessary to save raw index data? Will this data be loaded into memory?