Open Nick17t opened 1 year ago
Hey everyone! Thanks for your interest in this project. Let me give you some more details about what we are trying to achieve here:
DocArray v2 will have a concept called Document Index
. This is an abstraction that lets a user store their Documents (on disk or in a database), and retreive them using ANN search.
As such, there can be multiple Document Indexes backed by different backends: Elastic, Qdrant, Weaviat, ...., but all following the same basic API.
The idea behind this project is to take an ANN library and use it to implement a Document Index. There is already an implementation using HNSWLib that you can find here: https://github.com/docarray/docarray/pull/1124
But there is space to create similar backends using other libraries: Annoy, Faiss, ... The goal is to provide user choice.
If there is interest, someone could also implement a backend using a vector database. We already have Qdrant, Weaviate and Elastic covered, but Milvus, Redis and some otthers could also be interesting.
You can find a design doc for Document Index here: https://lightning-scent-57a.notion.site/Document-Stores-v2-design-doc-f11d6fe6ecee43f49ef88e0f1bf80b7f
If you have any questions please reach out!
Please mind that "Mentors | @Johannes" does not refer to the GitHub user @johannes, but probably @JohannesMessner. I can encourage you to do great things, but not help with the project. Have fun!
Ah, so we meet again @johannes! Snatching a common user name is a blessing and a curse I see. Thanks for the encouragement, and who knows, maybe if we keep randomly tagging you here and there one day you will be compelled to contribute as well ;)
Hello! @JohannesMessner
Thank you for sharing the details of your project and providing a clear explanation of what you are trying to achieve.
It sounds like you are working on creating a flexible and scalable document indexing solution using different ANN libraries and vector databases. Providing user choice and flexibility is always a great approach when it comes to open-source projects.
I appreciate that you have shared the design doc as well, which will help potential contributors understand the project's scope and requirements.
If I have any questions or would like to contribute to the project, I'll make sure to reach out.
Thanks again for sharing this project with us.
Best regards, Ranjan
Hello @JohannesMessner , I have been interested in GSOC contribution for 2023 and prior experience with Machine Learning algorithms and ANN search using the Python framework attracted my interest in this project. I would love to work on this under your valuable mentorship. I am providing my idea and implementation according to my experience of working with various ANN libraries and Jina architecture.
Project Description The DocArray library makes it easy to store, process or search multi-modal data, creating a huge database for vector data.It works on ANN search using the HNSWlib for Document Indexing leaving some other available packages which can perform potentially better than the existing framework.The project aims at using Annoy library which has a common industry level use case for DocArray wrap.
Importance of this project Jina already provides hsnwlib wrapper for DocArray wrap but there are other options including the FAISS,ANNOY,NGT libraries which needs to be explored for faster execution and user's choice to choose between frameworks.
Expected Results Storing data in form of Documents, provided with an index to the database and retrieval of documents using ANN search with Annoy library.
Project breakdown
Creating a collection of documents
Using Jina's Document Array library for object creation using the set of Documents.
Vectorizing the Document Array using Jina Flow object with a pre trained embedding model. I can also use numpy and sklearn to create a create an array of Document Array objects as an alternative.
The Annoy Index takes in dimensions and metrices which defines the dimensionality of the document vectors and the distance between these vectors respectively.
Adding each vector and its unique index ID in the index.
I will use the index object specifying the query vector, number of neighbours to be searched, number of nodes, to search through the random forest of vectors.
The vectors obtained can be filtered and the vector with least distance from the query vector is returned to to the user as a result of his query.
Required Technicalities Jina's Document Array Python numpy sklearn, annoy libraries
Additional Area of development
In the project idea you have hinted about implemening a backend framework in this project for vector database. Jina has already achieved it with the Qdrant but the Milvus framework can be a step ahead because of its scalability and efficiency.I propose to integrate the pymilvus library along with the ANN searching to provide a visual representation of the idea and create a better impact of overall project.
We can carry forward the discussion after your feedback. Thanks and Regards Arijit Ghosal
Hi @JohannesMessner @philipvollet I am a Masters student studying AI in University of Hamburg, Germany. I have knowledge in topics like statistical ML, NLP, computer vision. I have worked in multiple projects in Python, Pytorch, Keras. Apart from datascience stack I also have experience working in Java, Php, Swift. I came across this topic and got interested on work on it. I am a bit late to apply but I am interested to contribute and gain experience from this project. Could you please help me getting started with the project and if any call can be setup for a discussion
Hi @Anirbanbhk88 @ranjan2829 @arijitghosal03 @Anirbanbhk88
Thanks for your interest in contributing to the project. The application is just started, to ensure fairness, we do not open 1:1 calls during the application season from March 20 to April 4.
📅 But we have the webinar, Mark your calendars for the GSoC x Jina AI webinar on March 23rd at 2 pm (CET). This is an excellent opportunity to learn more about the projects and ask any questions you have about the requirements and expectations.
Our mentors will provide an in-depth overview of the projects and answer any questions you may have. So please don't hesitate to ask any questions or seek clarification on any aspect of the project.
Is there anything specific you would like to learn from the webinar? Do you have any questions about the DocArray wrap ANN libraries project that you would like to see clarified during the Q&A session? Let me know, and I'll be happy to help!
Looking forward to seeing you at the webinar, and thank you for your interest in the Jina AI community! 😊
Project idea 2: DocArray wrap ANN libraries
Project Description
In DocArray, we have been concentrating on developing production-ready Vector DBs for large-scale searches. However, there are many ANN libraries without scalability layers that can be integrated into DocArray, making it accessible to academia and production teams with small-to-medium amounts of data, without the need for external services.
DocArray v2 will have a concept called Document Index. This is an abstraction that lets a user store their Documents (on disk or in a database), and retrieve them using ANN search. As such, there can be multiple Document Indexes backed by different backends: Elastic, Qdrant, Weaviat, ...., but all following the same basic API.
The idea behind this project is to take an ANN library and use it to implement a Document Index. There is already an implementation using HNSWLib that you can find here: https://github.com/docarray/docarray/pull/1124, But there is space to create similar backends using other libraries: Annoy, Faiss, ... The goal is to provide user choice.
If there is interest, someone could also implement a backend using a vector database. We already have Qdrant, Weaviate, and Elastic covered, but Milvus, Redis, and some others could also be interesting. You can find a design doc for Document Index here.
Expected outcomes