sangheee / issues

2 stars 0 forks source link

Vector Database #22

Open sangheee opened 8 months ago

sangheee commented 8 months ago

What is a Vector Database?

vector databases are a vital tool in the machine learning and AI digital landscape.

A vector database is a database that stores information as vectors, which are numerical representations of data objects, also known as vector embeddings.

A vector database organizes data through high-dimensional vectors. High-dimensional vectors contain hundreds of dimensions, and each dimension corresponds to a specific feature or property of the data object it represents.

Vector databases are designed to handle data where each entry is represented as a vector in a multi-dimensional space. The vectors can represent a wide range of information, such as numerical features, embeddings from text or images, and even complex data like molecular structures.

image

image

  1. embeddings model을 사용해 색인하고자 하는 content에 대한 vector embeddings를 만든다.
  2. 이렇게 생성된 vector embeddings를 original content에 대한 일부 reference와 함께 vector database에 insert
  3. application이 질의하려 할 때, 동일한 embedding model을 사용해서 query에 대한 embedding을 생성한다. 생성한 embedding을 사용해 database에 similar vector embeddings(associated with the original content)를 찾는 질의를 수행한다.
    • 코사인 similarity는 가장 일반적인 유사성 계산 방법 중 하나

Vector Database Capabilities


Picking a vector database: a comparison and guide for 2023

Pinecone Weaviate Milvus Qdrant Chroma Elasticsearch PGvector
Is open source
Self-host
Cloud management (✔️)
Purpose-built for Vectors
Developer experience 👍👍👍 👍👍 👍👍 👍👍 👍👍 👍 👍
Community Community page & events 8k☆ github, 4k slack 23k☆ github, 4k slack 13k☆ github, 3k discord 9k☆ github, 6k discord 23k slack 6k☆ github
Queries per second (using text nytimes-256-angular) 150 *for p2, but more pods can be added 791 2406 326 ? 700-100 *from various reports 141
Latency, ms (Recall/Percentile 95 (millis), nytimes-256-angular) 1 *batched search, 0.99 recall, 200k SBERT 2 1 4 ? ? 8
Supported index types ? HNSW Multiple (11 total) HNSW HNSW HNSW HNSW/IVFFlat
Hybrid Search (i.e. scalar filtering)
Disk index support
Role-based access control
Dynamic segment placement vs. static data sharding ? Static sharding Dynamic segment placement Static sharding Dynamic segment placement Static sharding -
Free hosted tier (free self-hosted) (free self-hosted) (free self-hosted) (varies)
Pricing (50k vectors @1536) $70 fr. $25 fr. $65 est. $9 Varies $95 Varies
Pricing (20M vectors, 20M req. @768) $227 ($2074 for high performance) $1536 fr. $309 ($2291 for high performance) fr. $281 ($820 for high performance) Varies est. $1225 Varies
sangheee commented 8 months ago

Milvus

image

image

Why Milvus?

sangheee commented 6 months ago

create-a-movie-recommendation-engine-with-milvus-and-python

COLLECTION_NAME = 'film_vectors' PARTITION_NAME = 'Movie' 

Here's our record schema

""" "title": Film title, "overview": description, "release_date": film release date, "genres": film generes, "embedding": embedding """ 

id = FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=500, is_primary=True) field = FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=384)  schema = CollectionSchema(fields=[id, field], description="movie recommender: film vectors", enable_dynamic_field=True) 

if utility.has_collection(COLLECTION_NAME): # drop the same collection created before     collection = Collection(COLLECTION_NAME)     collection.drop()

collection = Collection(name=COLLECTION_NAME, schema=schema)

print("Collection created.")  index_params = {     "index_type": "IVF_FLAT",     "metric_type": "L2",     "params": {"nlist": 128}, }  collection.create_index(field_name="embedding", index_params=index_params) collection.load()  print("Collection indexed!")


 create a function for generating the embeddings using the _SentenceTransformer_ for the text.
```python
from sentence_transformers import SentenceTransformer
import ast 

# function to extract the text from genre column 
# clean the genre column and get the text out of it. 
def build_genres(data):
    genres = data['genres']
    genre_list = ""
    entries= ast.literal_eval(genres)
    genres = ""

    for entry in entries:
        genre_list = genre_list + entry["name"] + ", "
    genres += genre_list
    genres = "".join(genres.rsplit(",", 1))
    return genres 

# create an object of SentenceTransformer
transformer = SentenceTransformer('all-MiniLM-L6-v2') 

# function to generate embeddings
def embed_movie(data):
    embed = "{} Released on {}. Genres are {}.".format(data["overview"], data["release_date"], build_genres(data))
    # the encode() method to generate the embeddings using the overview, release_date and genre features.
    embeddings = transformer.encode(embed)
    return embeddings

It is too large to send to Milvus in a single insert statement, but sending data rows one at a time would create unnecessary network traffic and add too much time. So, create batches (e.g., 5,000 rows) of data to send to Milvus: (You can play with the batch size to suit your individual needs and preferences.)

# Loop counter for batching and showing progress
j = 0
batch = []

for movie_dict in movies_dict:
    try:
        movie_dict["embedding"] = embed_movie(movie_dict)
        batch.append(movie_dict)
        j += 1
        if j % 5 == 0:
            print("Embedded {} records".format(j))
            collection.insert(batch)
            print("Batch insert completed")
            batch=[]
    except Exception as e:
        print("Error inserting record {}".format(e))
        pprint(batch)
        break

collection.insert(movie_dict)
print("Final batch completed")
print("Finished with {} embeddings".format(j))

Recommend New Movies Using Milvus

# load collection memory before search
collection.load() 

# Set search parameters
topK = 5
# metric_type as L2 (squared Euclidean) that calculates the distance between two vectors and 
# nprobe that indicates the number of cluster units to search.
SEARCH_PARAM = {
    "metric_type":"L2", 
    "params":{"nprobe": 20},
}

# convert search string to embeddings
def embed_search(search_string):
    search_embeddings = transformer.encode(search_string)
    return search_embeddings

# search similar embeddings for user's query
def search_for_movies(search_string):
    user_vector = embed_search(search_string)
    return collection.search([user_vector],"embedding",param=SEARCH_PARAM, limit=topK, expr=None, output_fields=['title', 'overview'])
from pprint import pprint

search_string = "A comedy from the 1990s set in a hospital. The main characters are in their 20s and are trying to stop a vampire."
results = search_for_movies(search_string)

# check results
for hits in iter(results):
    for hit in hits:
        print(hit.entity.get('title'))
        print(hit.entity.get('overview'))
        print("-------------------------------")