jina-ai / annlite

⚡ A fast embedded library for approximate nearest neighbor search
Apache License 2.0
215 stars 22 forks source link

acquire train data from lmdb #148

Open jemmyshin opened 2 years ago

jemmyshin commented 2 years ago

Sometimes we need to train the PCA model when we already created an indexer. (for example, there is a memory issue after we have indexed thousands or even millions of data, and we need PCA to fix it.)

We need to fetch train data from lmdb, but this is tricky when we move to jcloud since we need to fetch data from the server instead of local machine.

One way to solve this is to add a new endpoint in client called /fetch:

data = client.post('/fetch', params={'batch_size': 1024})

for training we can use partial_train():

annlite.partial_train(data)

JoanFM commented 2 years ago

lets make sure we are not overcomplicating the indexer and its usage.

JoanFM commented 2 years ago

If this LMDB is not going to work on jcloud, please do not proceed with this idea. If there is a need for less memory, for now I would just expect to have proper contiguration from beginning.

JoanFM commented 2 years ago

what will be the difference between this training data and the indexed data?