jina-ai / GSoC

Google Summer of Code
65 stars 11 forks source link

Expand ANNLite capabilities with BM25 to build Hybrid Search #19

Open Nick17t opened 1 year ago

Nick17t commented 1 year ago

Project idea 4: Expand ANNLite capabilities with BM25 to build Hybrid Search

info details
Skills needed Python, C++, Lucene, ANN, Inverted Index
Project size 350 hours
Difficulty level Hard
Mentors @Felix Wang @Joan Martínez @Girish Chandrashekar

Project Description

Resources:

Expected outcomes

JoanFM commented 1 year ago

ANNlite is a Vector search library developed by Jina which is using HNSW as the algorithm to perform search. On top of this it allows to do filtering on Documents.

However, it can be important for the performance of search systems to be able to combine Vector Search algorithms with traditional text-search ones to get the best of both worlds.

This project is about evaluating and trying to apply Hybrid Search approaches on top of ANNLite.

Resources:

matchyc commented 1 year ago

Hi, Michael here. I'm familiar with ANNS, including various graph-based indexes (NSG, HNSW, Vamana, etc.), and a contributor to the Milvus community (advanced vector database project). I'm trying to understand the details in ANNLite. I will deliver my proposal draft as soon as possible.

Just a few concerns: Does this project need to integrate models to generate sparse vectors? It means that we only need to focus on hybrid search (maybe hybrid index construction) not how the input vectors (dense or sparse) are produced, am I correct?

JoanFM commented 1 year ago

Hi, Michael here. I'm familiar with ANNS, including various graph-based indexes (NSG, HNSW, Vamana, etc.), and a contributor to the Milvus community (advanced vector database project). I'm trying to understand the details in ANNLite. I will deliver my proposal draft as soon as possible.

Just a few concerns: Does this project need to integrate models to generate sparse vectors? It means that we only need to focus on hybrid search (maybe hybrid index construction) not how the input vectors (dense or sparse) are produced, am I correct?

It is correct, it should not care about how to create them at the beginning at least

matchyc commented 1 year ago

I'm going to submit the proposal with a detailed framework design, but should I talk to mentors before submitting it?

Nasafato commented 1 year ago

I think you can submit first. I just submitted for this as well.