superlinked / VectorHub

VectorHub is a free, open-source learning website for people (software developers to senior ML architects) interested in adding vector retrieval to their ML stack.
https://superlinked.com/vectorhub/
Other
413 stars 97 forks source link

Added vector indexes blog md file #422

Closed haziqa5122 closed 1 week ago

haziqa5122 commented 1 month ago

In this blog we discuss about Vector indexing, it's practical use cases and implementation.

haziqa5122 commented 1 month ago

Thanks for the suggestions Mor Kapronczay. Let me review and get back to you!

On Wed, Jul 17, 2024, 3:32 PM Mór Kapronczay @.***> wrote:

@.**** commented on this pull request.

Thank you for your upgraded submission! It was an interesting read. Left some comments for minor nitpicks and smaller clarifications.

Also I think there is something wrong with the code snippet rendering - I am not sure what though, because ```s are seem to be in order, please check.

In docs/articles/Vector-Indexes.md https://github.com/superlinked/VectorHub/pull/422#discussion_r1680804741 :

+A major benefit of vector data is that it allows for similarity search using vector indexes. The similarity search opens avenues for various new applications like Retrieval Augmented Generation (RAG). This blog will discuss vector indexes in detail. We will also look at popular types of indexing and implement vector indexing in Python using LlamaIndex. + +## Overview of Vector Indexes +Vector data is often called ‘Embeddings’, and these are generated via complex mathematical operations. The embeddings are a dense data representation, capturing key attributes describing the information. These attributes allow vectors to be arranged so that similar data points are closer. This operation, called vector indexing, has various benefits, including faster search operation, similarity matching, and pattern identification. The indexing operation requires specialized algorithms that understand the data and create groups of matching elements. + +Vector indexing also benefits Retrieval-Augmented Generation (RAG), allowing LLMs to sift extensive data and find relevant vectors efficiently. This is important since efficiency is key to modern AI application development. Moreover, it also allows LLMs to filter information similar to the user’s query so that the model can present all relevant information. Imagine an LLM is asked about the benefits of apples. The model will retrieve information from a vector database, and since it recognizes an apple as a fruit, it can also query information regarding bananas and oranges and construct a more information-rich response. + +The similarity search uses an Approximate Nearest Neighbour (ANN) algorithm to find matching data vectors. It calculates how close or distant vectors are using distance metrics like Euclidean distance or Jacobian distance. Closer points are grouped together and referenced in relevant queries. + +### Types of Indexing +Let’s discuss some popular indexes involved in vector indexing. + +#### Flat Indexing +Flat indexing is the most basic type of indexing, as it stores vectors in the database as they are. No special operation is performed for modification or optimization. When a query vector arrives, it is compared with every other vector within the database to generate a similarity score. Due to this, flat indexing is also considered a brute-force approach. Once the similarity scores are calculated, the top k closest matches are retrieved. + +Flat Indexing is a very precise algorithm, as it retrieves vectors with great accuracy. However, it is computationally expensive and not ideal in cases where the database consists of millions of records or more.

⬇️ Suggested change

-Flat Indexing is a very precise algorithm, as it retrieves vectors with great accuracy. However, it is computationally expensive and not ideal in cases where the database consists of millions of records or more. +Flat Indexing is a very precise algorithm, as it retrieves vectors with perfect accuracy. However, it is computationally expensive and not ideal in cases where the database consists of millions of records or more.


In docs/articles/Vector-Indexes.md https://github.com/superlinked/VectorHub/pull/422#discussion_r1680806111 :

+ +Vector indexing also benefits Retrieval-Augmented Generation (RAG), allowing LLMs to sift extensive data and find relevant vectors efficiently. This is important since efficiency is key to modern AI application development. Moreover, it also allows LLMs to filter information similar to the user’s query so that the model can present all relevant information. Imagine an LLM is asked about the benefits of apples. The model will retrieve information from a vector database, and since it recognizes an apple as a fruit, it can also query information regarding bananas and oranges and construct a more information-rich response. + +The similarity search uses an Approximate Nearest Neighbour (ANN) algorithm to find matching data vectors. It calculates how close or distant vectors are using distance metrics like Euclidean distance or Jacobian distance. Closer points are grouped together and referenced in relevant queries. + +### Types of Indexing +Let’s discuss some popular indexes involved in vector indexing. + +#### Flat Indexing +Flat indexing is the most basic type of indexing, as it stores vectors in the database as they are. No special operation is performed for modification or optimization. When a query vector arrives, it is compared with every other vector within the database to generate a similarity score. Due to this, flat indexing is also considered a brute-force approach. Once the similarity scores are calculated, the top k closest matches are retrieved. + +Flat Indexing is a very precise algorithm, as it retrieves vectors with great accuracy. However, it is computationally expensive and not ideal in cases where the database consists of millions of records or more. + +#### Locality Sensitive Hashing (LSH) +LSH optimizes the vector search operation by dividing the database elements into buckets. Here, we first use a hashing function to compute hashes for each vector element. Then, based on the hash’s similarity, the vectors are grouped together into buckets. Intuitively, each bucket contains a matching vector. +Now, when a query vector appears, it is first hashed using the same function. Then, based on a hash, it is assigned a bucket containing all its similar vectors. The query vector now only needs to be compared with the bucket vectors, reducing the search space and dramatically improving the efficiency.

⬇️ Suggested change

-Now, when a query vector appears, it is first hashed using the same function. Then, based on a hash, it is assigned a bucket containing all its similar vectors. The query vector now only needs to be compared with the bucket vectors, reducing the search space and dramatically improving the efficiency. +Now, when a query vector appears, it is first hashed using the same function. Then, based on a hash, it is assigned a bucket containing all its similar vectors. The query vector now only needs to be compared with the vectors in the assigned bucket, reducing the search space and dramatically improving the efficiency.


In docs/articles/Vector-Indexes.md https://github.com/superlinked/VectorHub/pull/422#discussion_r1680809302 :

+Flat indexing is the most basic type of indexing, as it stores vectors in the database as they are. No special operation is performed for modification or optimization. When a query vector arrives, it is compared with every other vector within the database to generate a similarity score. Due to this, flat indexing is also considered a brute-force approach. Once the similarity scores are calculated, the top k closest matches are retrieved. + +Flat Indexing is a very precise algorithm, as it retrieves vectors with great accuracy. However, it is computationally expensive and not ideal in cases where the database consists of millions of records or more. + +#### Locality Sensitive Hashing (LSH) +LSH optimizes the vector search operation by dividing the database elements into buckets. Here, we first use a hashing function to compute hashes for each vector element. Then, based on the hash’s similarity, the vectors are grouped together into buckets. Intuitively, each bucket contains a matching vector. +Now, when a query vector appears, it is first hashed using the same function. Then, based on a hash, it is assigned a bucket containing all its similar vectors. The query vector now only needs to be compared with the bucket vectors, reducing the search space and dramatically improving the efficiency. + +#### Inverted File (IVF) +IVF works similarly to LSH and creates groups of data. But rather than hashing it, it uses clustering techniques. The techniques can vary depending on the implementation, but simpler techniques may use K-means clustering and then use the cluster centroids as a reference for query vectors. The query vector is then compared with only its associated data cluster to improve efficiency. IVF has a few variations, each improving the storage and retrieval efficiency. + +##### IVFFLAT +The flat variation clusters the vectors and creates centroids but stores each vector as it is. It performs no additional processing beyond the clustering, and a brute-force approach is required to search through any given cluster. + +##### IVFPQ +Inverted File Product Quantization (IVFPQ) improves the vector storage cost by quantizing the data. It begins by clustering the data points, but before storing the vectors, it quantizes the data. Each vector in a cluster is divided into sub-vectors, and each sub-vector is encoded into bits using product quantization.

I think the last sentence needs to be clarified - I don't understand it. How are sub-vectors created?

In docs/articles/Vector-Indexes.md https://github.com/superlinked/VectorHub/pull/422#discussion_r1680809818 :

+Flat Indexing is a very precise algorithm, as it retrieves vectors with great accuracy. However, it is computationally expensive and not ideal in cases where the database consists of millions of records or more. + +#### Locality Sensitive Hashing (LSH) +LSH optimizes the vector search operation by dividing the database elements into buckets. Here, we first use a hashing function to compute hashes for each vector element. Then, based on the hash’s similarity, the vectors are grouped together into buckets. Intuitively, each bucket contains a matching vector. +Now, when a query vector appears, it is first hashed using the same function. Then, based on a hash, it is assigned a bucket containing all its similar vectors. The query vector now only needs to be compared with the bucket vectors, reducing the search space and dramatically improving the efficiency. + +#### Inverted File (IVF) +IVF works similarly to LSH and creates groups of data. But rather than hashing it, it uses clustering techniques. The techniques can vary depending on the implementation, but simpler techniques may use K-means clustering and then use the cluster centroids as a reference for query vectors. The query vector is then compared with only its associated data cluster to improve efficiency. IVF has a few variations, each improving the storage and retrieval efficiency. + +##### IVFFLAT +The flat variation clusters the vectors and creates centroids but stores each vector as it is. It performs no additional processing beyond the clustering, and a brute-force approach is required to search through any given cluster. + +##### IVFPQ +Inverted File Product Quantization (IVFPQ) improves the vector storage cost by quantizing the data. It begins by clustering the data points, but before storing the vectors, it quantizes the data. Each vector in a cluster is divided into sub-vectors, and each sub-vector is encoded into bits using product quantization. + +At search time, the query vector is first associated with a cluster. Then, it is quantified, and the encoded sub-vectors are compared with the encoded data within the cluster. This storage method has various benefits, such as it saves on space, and since we’re comparing smaller vectors, it improves vector search times.

⬇️ Suggested change

-At search time, the query vector is first associated with a cluster. Then, it is quantified, and the encoded sub-vectors are compared with the encoded data within the cluster. This storage method has various benefits, such as it saves on space, and since we’re comparing smaller vectors, it improves vector search times. +At search time, the query vector is first associated with a cluster. Then, it is quantized, and the encoded sub-vectors are compared with the encoded data within the cluster. This storage method has various benefits, such as it saves on space, and since we’re comparing smaller vectors, it improves vector search times.


In docs/articles/Vector-Indexes.md https://github.com/superlinked/VectorHub/pull/422#discussion_r1680816015 :

+ +SPANN improves upon DiskANN by introducing a memory-disk hybrid approach designed for billion-scale datasets. The approach clusters the data and stores the centroids in memory, but data clusters (or pointing lists) on disk. + +Since it involves read-write operations with the disk, these can also become a bottleneck. The authors use a few techniques to reduce disk read operations and storage costs. They use a balanced hierarchical clustering algorithm to ensure that posting lists are balanced in size. They also implement dynamic pruning to select clusters that match data points smartly. It first selects the top k closest clusters based on the distance between the centroids and the query. From the selected vectors, a relevance score between the pointing lists and the query is calculated to filter the search further and only retrieve the most relevant data points. + +#### Hierarchical Navigable Small Worlds (HNSW) +HNSW is a complicated but one of the most popular and efficient techniques for vector indexing. It combines proximity graphs and skip lists to divide the vector search space into layers. The lowest layer of the graph represents every vector as a vertex. All the vertices are connected based on their distance. Vertices are also connected across different layers depending on the position of other vectors in that layer. As we move up the multi-layer graph, the data points are grouped together, leaving fewer vertices at each level (similar to a skip list). + +During the search, the query vector is placed on the highest layer, which is matched with the closest vector using an ANN algorithm. We then move to the next layer based on the earlier selected closest neighbor and repeat the same process. This is continued until the final layer is reached. + +## Vector Indexes in Practical Use Cases +The similarity-matching capabilities of vector databases are used in various interesting applications. Some of these include: + +- Retrieval Augmented Generation (RAG): RAG uses vector indexing techniques to query relevant documents from an external database. These allow the LLM to construct a well-thought-out, accurate, and informative response for the user. + +- Image search using text queries: Vector indexing powers semantic search in image databases such as those in modern smartphones. This allows the user to input text queries and return images described by natural language.

⬇️ Suggested change

-- Image search using text queries: Vector indexing powers semantic search in image databases such as those in modern smartphones. This allows the user to input text queries and return images described by natural language. +- Image search using text queries: Vector indexing powers semantic search in image databases such as those in modern smartphones. This allows the user to input text queries and return images described by natural language.


In docs/articles/Vector-Indexes.md https://github.com/superlinked/VectorHub/pull/422#discussion_r1680817456 :

+ +bash +Vector 1: [0.80981775 0.1963886 0.02965684] +Vector 2: [0.56029493 0.62403894 0.56696611] +Vector 3: [0.47052236 0.1559688 0.87332513] +Vector 4: [0.80995196 0.00508334 0.10529516] +Vector 5: [0.55410133 0.96722797 0.96957061] +Vector 6: [0.20098567 0.0527692 0.65003235] +Vector 7: [0.34715575 0.41244063 0.72056698] +Vector 8: [0.30936325 0.47881762 0.75795186] +Vector 9: [0.67403853 0.23895253 0.87895722] +Vector 10: [0.8207672 0.21424442 0.20621931] +``` + +#### Distance Calculation +Now we need to function to calculate the distance between the different vectors. We will implement the Euclidean Distance to approximate the relation between the vectors.

⬇️ Suggested change

-Now we need to function to calculate the distance between the different vectors. We will implement the Euclidean Distance to approximate the relation between the vectors. +Now we need a function to calculate the distance between the different vectors. We will implement the Euclidean Distance to approximate the relation between the vectors.


In docs/articles/Vector-Indexes.md https://github.com/superlinked/VectorHub/pull/422#discussion_r1680810312 :

+Flat Indexing is a very precise algorithm, as it retrieves vectors with great accuracy. However, it is computationally expensive and not ideal in cases where the database consists of millions of records or more. + +#### Locality Sensitive Hashing (LSH) +LSH optimizes the vector search operation by dividing the database elements into buckets. Here, we first use a hashing function to compute hashes for each vector element. Then, based on the hash’s similarity, the vectors are grouped together into buckets. Intuitively, each bucket contains a matching vector. +Now, when a query vector appears, it is first hashed using the same function. Then, based on a hash, it is assigned a bucket containing all its similar vectors. The query vector now only needs to be compared with the bucket vectors, reducing the search space and dramatically improving the efficiency. + +#### Inverted File (IVF) +IVF works similarly to LSH and creates groups of data. But rather than hashing it, it uses clustering techniques. The techniques can vary depending on the implementation, but simpler techniques may use K-means clustering and then use the cluster centroids as a reference for query vectors. The query vector is then compared with only its associated data cluster to improve efficiency. IVF has a few variations, each improving the storage and retrieval efficiency. + +##### IVFFLAT +The flat variation clusters the vectors and creates centroids but stores each vector as it is. It performs no additional processing beyond the clustering, and a brute-force approach is required to search through any given cluster. + +##### IVFPQ +Inverted File Product Quantization (IVFPQ) improves the vector storage cost by quantizing the data. It begins by clustering the data points, but before storing the vectors, it quantizes the data. Each vector in a cluster is divided into sub-vectors, and each sub-vector is encoded into bits using product quantization. + +At search time, the query vector is first associated with a cluster. Then, it is quantified, and the encoded sub-vectors are compared with the encoded data within the cluster. This storage method has various benefits, such as it saves on space, and since we’re comparing smaller vectors, it improves vector search times.

maybe I am wrong here, but is that what you meant?

if not, please clarify

— Reply to this email directly, view it on GitHub https://github.com/superlinked/VectorHub/pull/422#pullrequestreview-2182505136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYRRH5J2UIPPFUFNUFUK7RLZMZB4JAVCNFSM6AAAAABKL3ZTG6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCOBSGUYDKMJTGY . You are receiving this because you authored the thread.Message ID: @.***>

haziqa5122 commented 1 month ago

Hello Mor,

I was about to email you that I have resolved the comments left on the PR but I don't see any code formatting issues. Could you point out where the code preview is misaligned?

Haziqa

On Sun, Jul 21, 2024, 6:30 PM Mór Kapronczay @.***> wrote:

@.**** approved this pull request.

Approved with the changes, handing it over to Robert for style review.

— Reply to this email directly, view it on GitHub https://github.com/superlinked/VectorHub/pull/422#pullrequestreview-2190334145, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYRRH5PMAWGGWE4KUMSJJM3ZNOZV7AVCNFSM6AAAAABKL3ZTG6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCOJQGMZTIMJUGU . You are receiving this because you authored the thread.Message ID: @.***>

robertdhayanturner commented 2 weeks ago

@haziqa5122 Basically done editing your article - nice work, btw. One pending question for you, line 96. Pls take a look thru the whole article and let me know if everything looks okay. If you have changes you want to make, pls just comment them, and I'll make the changes in the text myself (it's faster to do it this way for everyone). Thanks! :)

robertdhayanturner commented 1 week ago

@haziqa5122 okayed whole article via LinkedIn @robertdhayanturner awaiting table for line 96 from @haziqa5122; happy to do this myself in a couple days if necessary.