[Feature] Train IVF_PQ with custom data

shenfalong commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Is your feature request related to a problem? Please describe.

We could train model with custom datasize in FAISS to get the centroid and encoding table. For example, we have 1billion data totally and train IVFPQ with 10M data. From https://github.com/milvus-io/milvus/issues/1992 I think currently milvus train different models for every chunk data of fixed index_file_size. Whether it is possible to train a model and fix it for all future comming data? or has it been implemented already?

''' The all data mentioned here means current dataset. Once insert data size reaches index_file_size the milvus will use this part of data to create index. The other side client keep inserting data, if data size reaches the second index_file_size , milvus will create a new index, and the data range is newly added data. To your question, each time you call insert() only affect current dataset, and once current dataset be indexed, they won't be reuse again. The next create_index stage, won't use the data inserted in the last round. '''

Describe the solution you'd like.

For example, give 1B data, each item has 128dim, Step1 creat_index() insert(10M data) train() Step2 insert(1B data)

Step3 search(query)

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

xiaofan-luan commented 2 years ago

We are thinking of doing some offline calculation but

Is there an existing issue for this?

[x] I have searched the existing issues

Is your feature request related to a problem? Please describe.

We could train model with custom datasize in FAISS to get the centroid and encoding table. For example, we have 1billion data totally and train IVFPQ with 10M data. From #1992 I think currently milvus train different models for every chunk data of fixed index_file_size. Whether it is possible to train a model and fix it for all future comming data? or has it been implemented already?

''' The all data mentioned here means current dataset. Once insert data size reaches index_file_size the milvus will use this part of data to create index. The other side client keep inserting data, if data size reaches the second index_file_size , milvus will create a new index, and the data range is newly added data. To your question, each time you call insert() only affect current dataset, and once current dataset be indexed, they won't be reuse again. The next create_index stage, won't use the data inserted in the last round. '''

Describe the solution you'd like.

For example, give 1B data, each item has 128dim, Step1 creat_index() insert(10M data) train() Step2 insert(1B data)

Step3 search(query)

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

Very impressive suggestion!

Actually I'm thinking of how we can training on a smaller dataset(Maybe a another collection) for finding the data distribution and then using the trained result for further calculation, for instance:

PCA other way to reduce dimension
Reduce the IVF index build time
Split data into different data segments.

We should look for similar systems, and see how they handle the preprocess stage @czs007 @soothing-rain @wayblink

xiaofan-luan commented 2 years ago

@shenfalong thins remain to be discussed: 1) How many data can we use for preprocessing, did user trigger it automatically? 2) where should we located the preprocessing data? On another collection, or just like bulkload where we put files in a specified format on S3. 3) Should querynode, indexnode or other node working on the process? 4) How do collection load these features? 5) Does this help HNSW and other vector indexes?

shenfalong commented 2 years ago

@shenfalong thins remain to be discussed:

How many data can we use for preprocessing, did user trigger it automatically?

where should we located the preprocessing data? On another collection, or just like bulkload where we put files in a specified format on S3.

Should querynode, indexnode or other node working on the process?

How do collection load these features?

Does this help HNSW and other vector indexes?

I think leaving an parameter for user to trigger it is a good choice. The system may also set some default value. These preprocessing data could be put in a temporary collection and deleted after model training. These cluster centroid and encoding table could be saved on the disk as 'model'. The trained model could be loaded for processing full large scale data.

xiaofan-luan commented 2 years ago

e trained model could be loaded for processing full large scale data.

That seems like a plan~ if anyone want to take the design and implementation pls let me know

wayblink commented 2 years ago

Seems like a quite useful feature, I'd like to take part in it.

xiaofan-luan commented 2 years ago

Seems like a quite useful feature, I'd like to take part in it.

Let‘s probably take a glance at other system(For example Spark) and see what is the best plan~

wayblink commented 2 years ago

/assign

shiwanghua commented 1 year ago

Since 1 billion cannot be loaded to memory at once, how to do it ?

Create 10 indexes on 10 small datasets each of which has 10M vectors ?

And then merge the 10 indexes ???

wayblink commented 1 year ago

Since 1 billion cannot be loaded to memory at once, how to do it ?

Create 10 indexes on 10 small datasets each of which has 10M vectors ?

And then merge the 10 indexes ???

Hi, I guess your question have no relation with this issue. You can create a new one or discuss in user group. For your question, now Milvus has to load all data in memory to query. So it needs a large memory if you have a large dataset. You can split it into multi datasets and load one each time. However, it obviously will cost a lot of time. You can use some memory-friendly index type, such as DISKANN. DISKANN can save 80% memory cost but it needs nvme SSD. You can provide more info about your scenario and we can find a best solution.

milvus-io / milvus