Relation between create_index and training of centroids and PQ encoding for IVFPQ

milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications

https://milvus.io

Apache License 2.0

29.88k stars 2.87k forks source link

Relation between create_index and training of centroids and PQ encoding for IVFPQ #2214

Closed AyushP123 closed 4 years ago

AyushP123 commented 4 years ago

Hi,

Thank you for your good work. I had a few questions about create_index specific to IVFPQ. From what I understand, if I have inserted 10 million vectors into a collection (all 256 dimensions) and then I run:

index_param = {
        "m": 16,
        "nlist": 1024
}

status = client.create_index(collection_name, IndexType.IVF_PQ, index_param)

an IVFPQ index will be trained on the 10 million vectors.

If index_file_size for the collection is set to 4096 GB, and I add an additional 4 million vectors (index_file_size will be reached). Is create_index called on the 14 million vectors i.e. are the centroids and PQ encodings retrained on the 14 million vectors??.
If the answer to the first question is No: Let's say I call create_index after inserting 10 million vectors into a collection with nlists = 1000, do the centroids, and the encodings remain the same no matter how many vectors I insert into the collection??.
Is there any way to extract the encodings and the centroids for the IVFPQ from milvus??.

yhmo commented 4 years ago

Hi Ayush,

Assume you set index_file_size=4096MB, 10 million vectors inserted, before you invoke create_index, the 10 millions vectors will be written to a raw-data file 'a'(even the file size doesn't reach 4096MB), after create_index done, milvus train the index and generate another index file 'b'. Then you insert additional 4 million vectors, and invoke create_index again, it doesn't affect the file 'a' and 'b', the 4 million vectors will be written into another raw-data file 'c', and after create_index done, it train an new index base on the 4 million vectors and generate another index file 'd'.
Milvus use same index parameters to train each raw-data file and generate index file respectively. If you invoke create_index with nlist=1000 at first time, and then invoke create_index again with nlist=2000, Milvus will rebuild index for all raw-data files.
Currently there is no api to extract encodings and the centroids for the IVFPQ.

AyushP123 commented 4 years ago

Hi @yhmo, thank you for your response. I am currently working on to make sure that single query searches are implemented efficiently for streaming data uses.

Let me demonstrate my understanding of Milvus with the following example ( each vector is of size 1kb ):

Create a collection with index_file_size = 1024, dims = 256.
Create an index IVFPQ with nlists=1000, m = 16
Insert 800k vectors into the collection. Since the size of the vectors in the collection is 800 MB and is less than index_file_size no index is created for these 800k vectors. Search is implemented using Flat L2.
Add 500k more vectors into the collection. Now the total size of the vectors in the collection is 1.3 GB which is greater than index_file_size. As the index_file_size is exceeded Milvus splits these up into two partitions, partition A containing 1 million vectors whose size is equal to index_file_size and partition B containing 300k vectors. For partition A, IVFPQ index file is written while for Partition B Flat L2 is used for search.

Based on my understanding, I wanted to ask a few more questions:

Since index_file_size = 1024 and nlists = 1000, does this mean that 1000 centroids are trained on every 1 million vectors i.e. Let's say I insert sequentially 10 million vectors into the collection. Since the collection creates 10 segments of these 10 million vectors, are there 1000 x 10 centroids trained on the dataset??.
Since a small index_file_size leads to a reduction in search speed and a large index_file_size leads to no index being created for a lot of entries, wouldn't it be better if there is one global index file i.e. One IVFPQ file for searching through all the entries in the collection which can be updated manually by the user.

yhmo commented 4 years ago

Your understanding of insert/create_index is basically correct. I want to explain more about insert: Insert action only copy vector data into a buffer, another background thread will write the buffer into disk every second. If huge data was inserted in one action, milvus will split it into several data chunks, each chunk is less than 128MB. Each chunk is written to a single file, then merge them with previous files. For example, you firstly insert 800MB data, it will be split into 6 files: 128+128+128+128+128+32, then insert 500MB data, split into 4 files: 128+128+128+116, background merge thread will merge these files into two files: 1024+276. The reason to split data into chunks is to avoid uneven data distribution, for example, user firstly insert 1000MB and secondly insert 900MB, they are both less than 1024MB, if we merge them, then we will get a 1900MB file, it is too huge.

For the 2 extra questions:

Yes, 1000 centroids are trained on every 1 million vectors. 10 millions could be split into 10 raw-files, and milvus build 10 index files for them, each index file has 1000 centroids. When user performs search action, milvus will search in each index file and get 10 result sets, and reduce 10 result sets to final result set return to user.
We plan to allow user define train model for whole collection. User create a collection, specify an index type, and define a train model for the index type, all inserted vectors will use the train model to generate index file. This feature is in design phase.

AyushP123 commented 4 years ago

Hi @yhmo, thank you so much for your response, I wanted to ask a few more questions:

Suppose I insert 400 MB of data into an empty collection and index_file_size is 1024 MB. If I call create_index now, I understand that an index will be created on the vectors corresponding to 400 MB, let's call this file 'a'. If I insert another 400 MB of data and call create_index will the old file 'a' be destroyed a new index created on all the 800 MB of data or will there be a new file 'b' for the new 400 MB of data and the old file 'a' is retained?. Additionally, if I insert more 224 MB of data i.e. reach index_file_size will all the previously created index files get deleted along with a creation of a new index??.
Is there any way I can check out the index files through SQL and see the centroids and encodings??.

yhmo commented 4 years ago

@AyushP123 Sorry for late reply.

For the file 'a', its index has been created, so it become 'immutable' . And you insert another 400MB data and call create_index again, a new index file 'b' will be created for the new 400MB data. Now you have two 'immutable' files a&b. Then you insert more 224MB data, this new file is 'mutable', it has no index, it can be merged with new insert data until its size exceed index_file_size, or you can call craete_index again to force create index for it.
Currently we have no api to get index centroids and encodings from files. The SQL only record information for each file, such as 'create date', 'file_size', 'row_count', 'index_type', etc.

AyushP123 commented 4 years ago

Hi @yhmo, thank you so much for your response, your answers have helped me understand the operations of Milvus to a great extent. I wanted to ask one last question before closing this thread.

Let's say I create a collection with the index_file_size to be 1024 MB = 1 million vectors. I insert 10 million vectors and Milvus creates 10 segments out of this data. If I call create_index now, will there be 1 index file for all the 10 million for 10 index files, one for each segment?

yhmo commented 4 years ago

Assume you create a collection named "my_data", and dimension=512, index_file_size=1024MB, milvus will create a folder: /milvus/db/tables/my_data Then you insert 10M vectors, but not insert in one batch, since limit size for each insert action is 256MB. Assume each time you insert 100000 vectors, data size is 200M. Use flush() api to flush the new data into hard disk: time_1: insert(100000 vectors), flush(), file_1 created under /milvus/db/tables/my_data/xxxx, we call it a 'segment' time_2: insert(100000 vectors), flush(), file_2 created time_3: insert(100000 vectors), flush(), file_3 created time_4: insert(100000 vectors), flush(), file_4 created .....

After time_2, a background thread will merge the file_1(204.8MB) and file_2(204.8MB), since their file size is less than index_file_size, a new file is created merged by file_1 and file_2, let's call it file_1_2(409.6MB), and the file_1 and file_2 will be deleted after a short time. After time_3, the background thread merge the file_1_2 and file_3, a new file is created, let's call it file_1_2_3(614.4MB), and the file_3 will be deleted after a short time. ...... After time_5, file_1_2_3_4_5(1024MB) is created, its size large/equal than index_file_size, so it become 'immutable', the background merge thread will ignore it. ...... After all 10M vectors are inserted, there are 10 segments under the folder /milvus/db/tables/my_data, each segment folder container a 1024MB 'raw-data' file, then you invoke create_index, milvus will create index for each segment respectively, after index is finished, each segment folder contains two files('raw-data' file and index file), each segment has 1M vectors, each index is build from 1M vectors.

AyushP123 commented 4 years ago

Thank you for your response @yhmo.