Nick17t commented 1 year ago

Project idea 5: Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature

info	details
Skills needed	ANN, C++, Python, Databases
Project size	350 hours
Difficulty level	Hard
Mentors	@Felix Wang @Joan Martínez

Project Description

Jina is developing a stateful executor feature that enables Deployments with a state to be replicated and scaled. This opens the door to having a Vector Database in our ecosystem effectively and robustly. Iterating on ANNLite to act as the "Lucene" for Jina would be a great opportunity.

Expected outcomes

Prove and come up with an Executor in our Hub that uses ANNlite or DocArray with ANNLite as a backend to be the default Vector Databases for all our examples for mid-sized data requirements.

More Info

In DocArray (v2) Document Stores (soon to be renamed to "Document Index"), we want to support multiple vector DBs and ANN libraries to give more options to the user
You can read more about DocArray here
And about Document Stores
But note that we are currently working on v2 of DocArray, which will be quite different. You can read more here
And for Document Stores in v2, see this PR: https://github.com/docarray/docarray/pull/1124

JoanFM commented 1 year ago

ANNlite is a Vector search library developed by Jina which is using HNSW as the algorithm to perform search. On top of this it allows to do filtering on Documents.

However, as a simple library it has limited scalability capabilities. However, using Jina and wrapping it with an Executor, one may be able to add a replication and sharding layer easily. The scalability and performance of this solution is to be seen. The aim of this project is to make sure ANNLite can be used with an Executor as such.

Relevant documentation to follow:

ANNLite github [https://github.com/jina-ai/annlite]
RAFT algorithm (to provide consensus) [https://github.com/hashicorp/raft]
PR in Jina to add Stateful Executor capacity [https://github.com/jina-ai/jina/pull/5564]

kronsbein commented 1 year ago

@JoanFm @numb3r3 @Nick17t

Hey everyone, I'm a CS student from Berlin and interested in contributing to this project. I briefly went through the provided references and have a couple of general questions first:

I read about HNSW here on arxiv and was wondering if ANNLite's similarity search logic is based on this paper?
I try to understand the stateful executor a bit with the provided link to the pr. Is a comprehensive performance analysis of a potential approach also within the scope of this project?
Lastly, out of curiosity, are there any plans to add other open source vector similarity searches like Qdrant, Elastic?

Thank you and I'm happy to discuss further!

Best, Marvin

JoanFM commented 1 year ago

Hello @kronsbein ,

Answers to your quesitons:

Yes, it is based on this paper
This will be a beta feature, and we will potentially analyze the performance, and see the scale at which the solution work
No, it is out of the scope. The point is that with Jina we want to be able to handle Stateful payloads with our Executor abstraction without the need of external services that need other orchestrations. We want to evaluate to which extension and scale this can be achieved with StatefulExecutor + Vector Search lib as ANNLite.

Thanks,

Joan

Ahmed-Emad10 commented 1 year ago

@JoanFM @numb3r3 Hello, I'm Ahmed from Egypt student at Cairo university faculty of engineering computer engineering department. I'm interested in this project and want to participate in. I just wanted to know what to do and if there is anything that I have to learn to join? Also I don't know a lot about this project so are the links provided above sufficient? I appreciate your thoughts and time. I hope to hear from you soon. Best regards

Hansolo1103 commented 1 year ago

@JoanFM @numb3r3

Hello! I am Sohan Mishra , a student at National Institute of Technology(NIT) . I have read the docarray docs and would like to work on this project. I have experience with Python and C++ and have been learning a little bit about ANNLite for the past few days.

From what i understand on reading the first link under "More Info": The doc discusses using a document store (e.g., SQLite or Redis) as a storage backend for DocumentArrays to provide longer persistence and faster retrieval. The DocumentArrays with a document store look and feel almost the same as a regular in-memory DocumentArray, allowing easy switching between backends. The section explains how to initialize a DocumentArray with an external storage backend and how to create, retrieve, update, and delete Documents. It also introduces the concept of subindices for multimodal or nested data, and it summarizes the key functionalities of document stores, including vector search, vector search + filter, and filter.

amangupta201 commented 1 year ago

With storage='annlite', AnnLiteIndexer indexes Documents into a DocumentArray. Here, the DocumentArray makes effective use of AnnLite to store and search Documents. The following shows the code snippet for the vector search:

from jina import Flow from docarray import Document import numpy as np

f = Flow().add( uses='jinahub://AnnLiteIndexer', uses_with={'n_dim': 2}, )

with f: f.post( on='/index', inputs=[ Document(id='a', embedding=np.array([1, 3])), Document(id='b', embedding=np.array([1, 1])), ], )

docs = f.post(
    on='/search',
    inputs=[Document(embedding=np.array([1, 1]))],
)

will print "The ID of the best match of [1,1] is: b"

print('The ID of the best match of [1,1] is: ', docs[0].matches[0].id)

Nick17t commented 1 year ago

Hi @kronsbein @Ahmed-Emad10 @Hansolo1103

I am delighted to hear that you are interested in contributing to the Jina AI community! 🎉

To get started, please take a moment to fill out our survey so that we can learn more about you and your skills.

Also, don't forget to mark your calendars for the GSoC x Jina AI webinar on March 23rd at 2 pm (CET). This is an excellent opportunity to learn more about the projects and ask any questions you have about the requirements and expectations.

Our mentors will provide an in-depth overview of the projects and answer any questions you may have. So please don't hesitate to ask any questions or seek clarification on any aspect of the project.

Is there anything specific you would like to learn from the webinar? Do you have any questions about the Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature project that you would like to see clarified during the Q&A session? Let me know, and I'll be happy to help!

Looking forward to seeing you at the webinar, and thank you for your interest in the Jina AI community! 😊

jina-ai / GSoC

Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature #20

Project idea 5: Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature

Project Description

Expected outcomes

will print "The ID of the best match of [1,1] is: b"