jina-ai / GSoC

Google Summer of Code
65 stars 11 forks source link

DocArray wrap ANN libraries #17

Open Nick17t opened 1 year ago

Nick17t commented 1 year ago

Project idea 2: DocArray wrap ANN libraries

Info details
Skills needed Python, ANN Search experience
Project size 175 hours
Difficulty level Medium
Mentors @Johannes Messner, @Sami Jaghouar, @Philip Vollet

Project Description

Expected outcomes

ranjan2829 commented 1 year ago

Project idea 2: DocArray wrap ANN libraries

Info | details -- | -- Skills needed | Python, ANN Search experience Project size | 175 hours Difficulty level | Medium Mentors | @Johannes, @Sami Jaghouar, @Philip Vollet

Project Description

Expected outcomes

Project idea 2: DocArray wrap ANN libraries Info details Skills needed Python, ANN Search experience Project size 175 hours Difficulty level Medium Mentors @[Johannes](https://github.com/JohannesMessner), @[Sami Jaghouar](https://github.com/samsja), @[Philip Vollet](https://www.linkedin.com/in/philipvollet) Project Description In DocArray, we have been concentrating on developing production-ready Vector DBs for large-scale searches. However, there are many ANN libraries without scalability layers that can be integrated into DocArray, making it accessible to academia and production teams with small-to-medium amounts of data, without the need for external services. Expected outcomes Jina's DocArray is a data structure that represents a list of documents with additional metadata. DocArray is designed to be compatible with popular ANN libraries like FAISS, Annoy, and Hnswlib. To wrap an ANN library around Jina's DocArray, you can follow these general steps: Convert DocArray to a compatible format: Most ANN libraries require a specific format for the data, like a numpy array or a list of lists. You can use Jina's get_all_sparse_vectors method to convert the DocArray to a compatible format. For example: import numpy as np from jina.types.arrays import DocArray doc_array = DocArray([{'text': 'hello world'}, {'text': 'foo bar'}]) # Convert to numpy array data = np.stack(doc_array.get_all_sparse_vectors()) This code creates a DocArray with two documents and uses get_all_sparse_vectors to convert the data to a numpy array. Create an index: Next, you need to create an index using the ANN library. For example, you can use FAISS to create an index: # Create an index index = faiss.IndexFlatL2(data.shape[1]) index.add(data) This code creates a FAISS index and adds the data to the index. Query the index: Finally, you can use the ANN library to query the index with a new document. For example, you can use FAISS to find the nearest neighbors of a new document: # Query the index query_vec = np.random.rand(1, data.shape[1]).astype('float32') distances, indices = index.search(query_vec, 10) # Get the DocArray for the nearest neighbors nearest_neighbors = doc_array[indices[0]] This code creates a random query vector and uses FAISS to find the 10 nearest neighbors in the index. Then, it retrieves the DocArray for the nearest neighbors using the indices array. By following these steps, you should be able to wrap an ANN library around Jina's DocArray and use it to perform nearest neighbor search or other ANN tasks. Of course, you will need to add more code to handle things like data preprocessing, index optimization, and query filtering, but this should give you a good starting point.
JohannesMessner commented 1 year ago

Hey everyone! Thanks for your interest in this project. Let me give you some more details about what we are trying to achieve here:

DocArray v2 will have a concept called Document Index. This is an abstraction that lets a user store their Documents (on disk or in a database), and retreive them using ANN search.

As such, there can be multiple Document Indexes backed by different backends: Elastic, Qdrant, Weaviat, ...., but all following the same basic API.

The idea behind this project is to take an ANN library and use it to implement a Document Index. There is already an implementation using HNSWLib that you can find here: https://github.com/docarray/docarray/pull/1124

But there is space to create similar backends using other libraries: Annoy, Faiss, ... The goal is to provide user choice.

If there is interest, someone could also implement a backend using a vector database. We already have Qdrant, Weaviate and Elastic covered, but Milvus, Redis and some otthers could also be interesting.

You can find a design doc for Document Index here: https://lightning-scent-57a.notion.site/Document-Stores-v2-design-doc-f11d6fe6ecee43f49ef88e0f1bf80b7f

If you have any questions please reach out!

johannes commented 1 year ago

Please mind that "Mentors | @Johannes" does not refer to the GitHub user @johannes, but probably @JohannesMessner. I can encourage you to do great things, but not help with the project. Have fun!

JohannesMessner commented 1 year ago

Ah, so we meet again @johannes! Snatching a common user name is a blessing and a curse I see. Thanks for the encouragement, and who knows, maybe if we keep randomly tagging you here and there one day you will be compelled to contribute as well ;)

ranjan2829 commented 1 year ago

Hello! @JohannesMessner

Thank you for sharing the details of your project and providing a clear explanation of what you are trying to achieve.

It sounds like you are working on creating a flexible and scalable document indexing solution using different ANN libraries and vector databases. Providing user choice and flexibility is always a great approach when it comes to open-source projects.

I appreciate that you have shared the design doc as well, which will help potential contributors understand the project's scope and requirements.

If I have any questions or would like to contribute to the project, I'll make sure to reach out.

Thanks again for sharing this project with us.

Best regards, Ranjan

arijitghosal03 commented 1 year ago

Hello @JohannesMessner , I have been interested in GSOC contribution for 2023 and prior experience with Machine Learning algorithms and ANN search using the Python framework attracted my interest in this project. I would love to work on this under your valuable mentorship. I am providing my idea and implementation according to my experience of working with various ANN libraries and Jina architecture.

PROJECT IDEA : DocArray wrap ANN libraries

Project Description The DocArray library makes it easy to store, process or search multi-modal data, creating a huge database for vector data.It works on ANN search using the HNSWlib for Document Indexing leaving some other available packages which can perform potentially better than the existing framework.The project aims at using Annoy library which has a common industry level use case for DocArray wrap.

Importance of this project Jina already provides hsnwlib wrapper for DocArray wrap but there are other options including the FAISS,ANNOY,NGT libraries which needs to be explored for faster execution and user's choice to choose between frameworks.

Expected Results Storing data in form of Documents, provided with an index to the database and retrieval of documents using ANN search with Annoy library.

Project breakdown

  1. Document creation
  1. Annoy library Indexing

Required Technicalities Jina's Document Array Python numpy sklearn, annoy libraries

Additional Area of development

In the project idea you have hinted about implemening a backend framework in this project for vector database. Jina has already achieved it with the Qdrant but the Milvus framework can be a step ahead because of its scalability and efficiency.I propose to integrate the pymilvus library along with the ANN searching to provide a visual representation of the idea and create a better impact of overall project.

We can carry forward the discussion after your feedback. Thanks and Regards Arijit Ghosal

Anirbanbhk88 commented 1 year ago

Hi @JohannesMessner @philipvollet I am a Masters student studying AI in University of Hamburg, Germany. I have knowledge in topics like statistical ML, NLP, computer vision. I have worked in multiple projects in Python, Pytorch, Keras. Apart from datascience stack I also have experience working in Java, Php, Swift. I came across this topic and got interested on work on it. I am a bit late to apply but I am interested to contribute and gain experience from this project. Could you please help me getting started with the project and if any call can be setup for a discussion

Nick17t commented 1 year ago

Hi @Anirbanbhk88 @ranjan2829 @arijitghosal03 @Anirbanbhk88

Thanks for your interest in contributing to the project. The application is just started, to ensure fairness, we do not open 1:1 calls during the application season from March 20 to April 4.

📅 But we have the webinar, Mark your calendars for the GSoC x Jina AI webinar on March 23rd at 2 pm (CET). This is an excellent opportunity to learn more about the projects and ask any questions you have about the requirements and expectations.

Our mentors will provide an in-depth overview of the projects and answer any questions you may have. So please don't hesitate to ask any questions or seek clarification on any aspect of the project.

Is there anything specific you would like to learn from the webinar? Do you have any questions about the DocArray wrap ANN libraries project that you would like to see clarified during the Q&A session? Let me know, and I'll be happy to help!

Looking forward to seeing you at the webinar, and thank you for your interest in the Jina AI community! 😊