jina-ai / executors

internal-only
Apache License 2.0
31 stars 12 forks source link

feat(faiss): load a trained index and add a training endpoint #127

Closed numb3r3 closed 3 years ago

numb3r3 commented 3 years ago

One additional training endpoint is introduced in this PR.

To use a trainable faiss indexer (e.g., IVF based):

1) we can first apply a training flow to train a faiss indexer (absolutely, user can also implement a locally training script using native faiss api):

from jina import Flow
import numpy as np

train_filepath = 'train.npy'
train_data = np.array(np.random.random([10240, 256]), dtype=np.float32)
np.save(train_filepath, train_data)

f = Flow().add(
    uses="jinahub://FaissSearcher",
    timeout_ready=-1,
    uses_with={
        'index_key': 'IVF10_HNSW32,PQ64',
        'trained_index_file': 'faiss.index',
        'on_gpu': False,
    },
)

with f:
    # the trained index will be dumped to "faiss.index"
    f.post(on='/train', parameters={'train_filepath': train_filepath})

2) Then in the query runtime, we can use FaissSearch by providing a pre-trained index file resulted from step 1, e.g.,

f = Flow().add(
    uses="jinahub://FaissSearcher",
    timeout_ready=-1,
    uses_with={
        'index_key': 'IVF10_HNSW32,PQ64',
        'trained_index_file': 'faiss.index', # trained indexer 
        'on_gpu': False,
        'dump_path': '/path/to/dump_file'
    },
)

One limitation to mention is:

Concurrent search/add or add/add are not supported. https://github.com/facebookresearch/faiss/issues/367

Hence the index operation can not utilize multi-core CPUs to speed up.

JoanFM commented 3 years ago

@numb3r3 concurrent operations should not be a problem, our parallelization model is based on multiprocess so one instance should have one index

cristianmtr commented 3 years ago

Can you update the README with some notes on this change?