jina-ai / GSoC

Google Summer of Code
66 stars 11 forks source link

Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature #20

Open Nick17t opened 1 year ago

Nick17t commented 1 year ago

Project idea 5: Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature

info details
Skills needed ANN, C++, Python, Databases
Project size 350 hours
Difficulty level Hard
Mentors @Felix Wang @Joan Martínez

Project Description

Expected outcomes

More Info

JoanFM commented 1 year ago

ANNlite is a Vector search library developed by Jina which is using HNSW as the algorithm to perform search. On top of this it allows to do filtering on Documents.

However, as a simple library it has limited scalability capabilities. However, using Jina and wrapping it with an Executor, one may be able to add a replication and sharding layer easily. The scalability and performance of this solution is to be seen. The aim of this project is to make sure ANNLite can be used with an Executor as such.

Relevant documentation to follow:

kronsbein commented 1 year ago

@JoanFm @numb3r3 @Nick17t

Hey everyone, I'm a CS student from Berlin and interested in contributing to this project. I briefly went through the provided references and have a couple of general questions first:

Thank you and I'm happy to discuss further!

Best, Marvin

JoanFM commented 1 year ago

Hello @kronsbein ,

Answers to your quesitons:

  1. Yes, it is based on this paper

  2. This will be a beta feature, and we will potentially analyze the performance, and see the scale at which the solution work

  3. No, it is out of the scope. The point is that with Jina we want to be able to handle Stateful payloads with our Executor abstraction without the need of external services that need other orchestrations. We want to evaluate to which extension and scale this can be achieved with StatefulExecutor + Vector Search lib as ANNLite.

Thanks,

Joan

Ahmed-Emad10 commented 1 year ago

@JoanFM @numb3r3 Hello, I'm Ahmed from Egypt student at Cairo university faculty of engineering computer engineering department. I'm interested in this project and want to participate in. I just wanted to know what to do and if there is anything that I have to learn to join? Also I don't know a lot about this project so are the links provided above sufficient? I appreciate your thoughts and time. I hope to hear from you soon. Best regards

Hansolo1103 commented 1 year ago

@JoanFM @numb3r3

Hello! I am Sohan Mishra , a student at National Institute of Technology(NIT) . I have read the docarray docs and would like to work on this project. I have experience with Python and C++ and have been learning a little bit about ANNLite for the past few days.

From what i understand on reading the first link under "More Info": The doc discusses using a document store (e.g., SQLite or Redis) as a storage backend for DocumentArrays to provide longer persistence and faster retrieval. The DocumentArrays with a document store look and feel almost the same as a regular in-memory DocumentArray, allowing easy switching between backends. The section explains how to initialize a DocumentArray with an external storage backend and how to create, retrieve, update, and delete Documents. It also introduces the concept of subindices for multimodal or nested data, and it summarizes the key functionalities of document stores, including vector search, vector search + filter, and filter.

amangupta201 commented 1 year ago

With storage='annlite', AnnLiteIndexer indexes Documents into a DocumentArray. Here, the DocumentArray makes effective use of AnnLite to store and search Documents. The following shows the code snippet for the vector search:

from jina import Flow from docarray import Document import numpy as np

f = Flow().add( uses='jinahub://AnnLiteIndexer', uses_with={'n_dim': 2}, )

with f: f.post( on='/index', inputs=[ Document(id='a', embedding=np.array([1, 3])), Document(id='b', embedding=np.array([1, 1])), ], )

docs = f.post(
    on='/search',
    inputs=[Document(embedding=np.array([1, 1]))],
)

will print "The ID of the best match of [1,1] is: b"

print('The ID of the best match of [1,1] is: ', docs[0].matches[0].id)

Nick17t commented 1 year ago

Hi @kronsbein @Ahmed-Emad10 @Hansolo1103

I am delighted to hear that you are interested in contributing to the Jina AI community! 🎉

To get started, please take a moment to fill out our survey so that we can learn more about you and your skills.

Also, don't forget to mark your calendars for the GSoC x Jina AI webinar on March 23rd at 2 pm (CET). This is an excellent opportunity to learn more about the projects and ask any questions you have about the requirements and expectations.

Our mentors will provide an in-depth overview of the projects and answer any questions you may have. So please don't hesitate to ask any questions or seek clarification on any aspect of the project.

Is there anything specific you would like to learn from the webinar? Do you have any questions about the Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature project that you would like to see clarified during the Q&A session? Let me know, and I'll be happy to help!

Looking forward to seeing you at the webinar, and thank you for your interest in the Jina AI community! 😊