lightonai / pylate

Late Interaction Models Training & Retrieval
https://lightonai.github.io/pylate/
MIT License
158 stars 7 forks source link

Add indexing based on Weaviate #15

Closed NohTow closed 4 months ago

NohTow commented 4 months ago

This PR introduces the indexing logic to the library. It implements an index based on the Weaviate vectordb to store the embeddings and can be used for retrieval (candidate generated using an HNSW index) and reranking. It relies on a beta version of weaviate-client to use async queries for batching (too slow otherwise).

It also introduces some changes to make the inference as close as possible to the original ColBERT model (max query/doc length, skiplist, ...). These changes also affect training.

Finally, it adds an evaluation script on BEIR used to bench the results (which are coherent with results from the ones reported by Benjamin)