sangheee commented 8 months ago

What is a Vector Database?

vector databases are a vital tool in the machine learning and AI digital landscape.

A vector database is a database that stores information as vectors, which are numerical representations of data objects, also known as vector embeddings.

A vector database organizes data through high-dimensional vectors. High-dimensional vectors contain hundreds of dimensions, and each dimension corresponds to a specific feature or property of the data object it represents.

Vector databases are designed to handle data where each entry is represented as a vector in a multi-dimensional space. The vectors can represent a wide range of information, such as numerical features, embeddings from text or images, and even complex data like molecular structures.

This simplified grid captures the essence of how a vector database could be represented visually, even though the actual vector spaces might have many more dimensions and use sophisticated techniques for search and retrieval.

Vector embeddings are a way of turning complex objects into numerical vectors that capture their characteristics, and vector databases use these embeddings to efficiently search and retrieve similar or relevant objects based on their positions in the vector space.
large language models, generative AI, semantic search 와 같은 데이터 처리가 중요한 어플리케이션들은 vector embeddings에 의존한다.
- vector embeddings: AI가 복잡한 작업을 수행할 때 활용할 수 있는, long-term memory를 이해하고 유지하는데 중요한 semantic information을 전달하는 다차원 공간의 벡터로 표현되는 data representation의 종류
- embeddings는 LLM같은 AI model에 의해 만들어지며, attributes나 features가 많아 representation managing을 어렵게 한다.
- working with vector embeddings is that traditional scalar-based databases can’t keep up with the complexity and scale of such data, making it difficult to extract insights and perform real-time analysis
- 이런 타입의 데이터를 다루기위해 Vector databased가 고안되었다.
- semantic information retrieval, long-term memory, and more.
vectors는 유사한 objects는 서로 가까운 벡터 공간에 있고, 서로 다른 객체는 멀리 떨어진 벡터를 갖도록 설계되었다.
vector database에 질의를 할 때, 주어진 질의 embedding과 거리가 가까운, 유사한 객체들을 찾는다.

embeddings model을 사용해 색인하고자 하는 content에 대한 vector embeddings를 만든다.
이렇게 생성된 vector embeddings를 original content에 대한 일부 reference와 함께 vector database에 insert
application이 질의하려 할 때, 동일한 embedding model을 사용해서 query에 대한 embedding을 생성한다. 생성한 embedding을 사용해 database에 similar vector embeddings(associated with the original content)를 찾는 질의를 수행한다.
- 코사인 similarity는 가장 일반적인 유사성 계산 방법 중 하나

Vector Database Capabilities

Efficient Similarity Search
- 추천 시스템, 이미지/비디오 검색, 얼굴 인식, 정보 검색 등
High-Dimensional Data
- 고차원 데이터를 보다 효율적으로 처리, 자연어 처리, 컴퓨터 비전 등의 어플리케이션에 적합
Machine Learning and AI
- embeddings는 data의 feature를 capture, can be used for various tasks, such as clustering, classification, and anomaly detection.
Real-time Applications
Personalization and User Profiling
Spatial and Geographic Data
Healthcare and Life Sciences
Data Fusion and Integration
- Vector databases can integrate data from various sources and types, enabling more comprehensive analysis and insights.
Multilingual Search
Graph Data

Picking a vector database: a comparison and guide for 2023

	Pinecone	Weaviate	Milvus	Qdrant	Chroma	Elasticsearch	PGvector
Is open source	❌	✅	✅	✅	✅	❌	✅
Self-host	❌	✅	✅	✅	✅	✅	✅
Cloud management	✅	✅	✅	✅	❌	✅	(✔️)
Purpose-built for Vectors	✅	✅	✅	✅	✅	❌	❌
Developer experience	👍👍👍	👍👍	👍👍	👍👍	👍👍	👍	👍
Community	Community page & events	8k☆ github, 4k slack	23k☆ github, 4k slack	13k☆ github, 3k discord	9k☆ github, 6k discord	23k slack	6k☆ github
Queries per second (using text nytimes-256-angular)	150 *for p2, but more pods can be added	791	2406	326	?	700-100 *from various reports	141
Latency, ms (Recall/Percentile 95 (millis), nytimes-256-angular)	1 *batched search, 0.99 recall, 200k SBERT	2	1	4	?	?	8
Supported index types	?	HNSW	Multiple (11 total)	HNSW	HNSW	HNSW	HNSW/IVFFlat
Hybrid Search (i.e. scalar filtering)	✅	✅	✅	✅	✅	✅	✅
Disk index support	✅	✅	✅	✅	✅	❌	✅
Role-based access control	✅	❌	✅	❌	❌	✅	❌
Dynamic segment placement vs. static data sharding	?	Static sharding	Dynamic segment placement	Static sharding	Dynamic segment placement	Static sharding	-
Free hosted tier	✅	✅	✅	(free self-hosted)	(free self-hosted)	(free self-hosted)	(varies)
Pricing (50k vectors @1536)	$70	fr. $25	fr. $65	est. $9	Varies	$95	Varies
Pricing (20M vectors, 20M req. @768)	$227 ($2074 for high performance)	$1536	fr. $309 ($2291 for high performance)	fr. $281 ($820 for high performance)	Varies	est. $1225	Varies

sangheee commented 8 months ago

Milvus

Following the principle of data plane and control plane disaggregation, Milvus comprises four layers: access layer, coordinator service, worker node, and storage. These layers are mutually independent when it comes to scaling or disaster recovery.

Why Milvus?

High performance when conducting vector search on massive datasets.
A developer-first community that offers multi-language support and toolchain.
Cloud scalability and high reliability even in the event of a disruption.
Hybrid search achieved by pairing scalar filtering with vector similarity search.

sangheee commented 6 months ago

create-a-movie-recommendation-engine-with-milvus-and-python

SentenceTransformers
```
flowchart LR
A[Text] -->|Sentence Transformers| B(Vectors)
B -->|Store| C[(Milvus)]  
```
Calculate the embeddings for the text data First, create a collection object that will store the movie ID and embeddings for the text data. Also, create an index field to make searches more efficient:

COLLECTION_NAME = 'film_vectors' PARTITION_NAME = 'Movie'

Here's our record schema

""" "title": Film title, "overview": description, "release_date": film release date, "genres": film generes, "embedding": embedding """

id = FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=500, is_primary=True) field = FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=384) schema = CollectionSchema(fields=[id, field], description="movie recommender: film vectors", enable_dynamic_field=True)

if utility.has_collection(COLLECTION_NAME): # drop the same collection created before collection = Collection(COLLECTION_NAME) collection.drop()

collection = Collection(name=COLLECTION_NAME, schema=schema)

print("Collection created.") index_params = { "index_type": "IVF_FLAT", "metric_type": "L2", "params": {"nlist": 128}, } collection.create_index(field_name="embedding", index_params=index_params) collection.load() print("Collection indexed!")


 create a function for generating the embeddings using the _SentenceTransformer_ for the text.
```python
from sentence_transformers import SentenceTransformer
import ast 

# function to extract the text from genre column 
# clean the genre column and get the text out of it. 
def build_genres(data):
    genres = data['genres']
    genre_list = ""
    entries= ast.literal_eval(genres)
    genres = ""

    for entry in entries:
        genre_list = genre_list + entry["name"] + ", "
    genres += genre_list
    genres = "".join(genres.rsplit(",", 1))
    return genres 

# create an object of SentenceTransformer
transformer = SentenceTransformer('all-MiniLM-L6-v2') 

# function to generate embeddings
def embed_movie(data):
    embed = "{} Released on {}. Genres are {}.".format(data["overview"], data["release_date"], build_genres(data))
    # the encode() method to generate the embeddings using the overview, release_date and genre features.
    embeddings = transformer.encode(embed)
    return embeddings

It is too large to send to Milvus in a single insert statement, but sending data rows one at a time would create unnecessary network traffic and add too much time. So, create batches (e.g., 5,000 rows) of data to send to Milvus: (You can play with the batch size to suit your individual needs and preferences.)

# Loop counter for batching and showing progress
j = 0
batch = []

for movie_dict in movies_dict:
    try:
        movie_dict["embedding"] = embed_movie(movie_dict)
        batch.append(movie_dict)
        j += 1
        if j % 5 == 0:
            print("Embedded {} records".format(j))
            collection.insert(batch)
            print("Batch insert completed")
            batch=[]
    except Exception as e:
        print("Error inserting record {}".format(e))
        pprint(batch)
        break

collection.insert(movie_dict)
print("Final batch completed")
print("Finished with {} embeddings".format(j))

Recommend New Movies Using Milvus

# load collection memory before search
collection.load() 

# Set search parameters
topK = 5
# metric_type as L2 (squared Euclidean) that calculates the distance between two vectors and 
# nprobe that indicates the number of cluster units to search.
SEARCH_PARAM = {
    "metric_type":"L2", 
    "params":{"nprobe": 20},
}

# convert search string to embeddings
def embed_search(search_string):
    search_embeddings = transformer.encode(search_string)
    return search_embeddings

# search similar embeddings for user's query
def search_for_movies(search_string):
    user_vector = embed_search(search_string)
    return collection.search([user_vector],"embedding",param=SEARCH_PARAM, limit=topK, expr=None, output_fields=['title', 'overview'])

embed_search(): You need a transformer to convert the user’s search string to an embedding. This function takes the viewer’s criteria and passes it to the same transformer you used to populate Milvus.
search_for_movies(): This function performs the actual vector search using the other function for support.

from pprint import pprint

search_string = "A comedy from the 1990s set in a hospital. The main characters are in their 20s and are trying to stop a vampire."
results = search_for_movies(search_string)

# check results
for hits in iter(results):
    for hit in hits:
        print(hit.entity.get('title'))
        print(hit.entity.get('overview'))
        print("-------------------------------")

sangheee / issues

Vector Database #22

What is a Vector Database?

Vector Database Capabilities

Milvus

Why Milvus?

Here's our record schema