Open sangheee opened 8 months ago
create-a-movie-recommendation-engine-with-milvus-and-python
flowchart LR
A[Text] -->|Sentence Transformers| B(Vectors)
B -->|Store| C[(Milvus)]
Calculate the embeddings for the text data First, create a collection object that will store the movie ID and embeddings for the text data. Also, create an index field to make searches more efficient:
COLLECTION_NAME = 'film_vectors' PARTITION_NAME = 'Movie'
""" "title": Film title, "overview": description, "release_date": film release date, "genres": film generes, "embedding": embedding """
id = FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=500, is_primary=True) field = FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=384) schema = CollectionSchema(fields=[id, field], description="movie recommender: film vectors", enable_dynamic_field=True)
if utility.has_collection(COLLECTION_NAME): # drop the same collection created before collection = Collection(COLLECTION_NAME) collection.drop()
collection = Collection(name=COLLECTION_NAME, schema=schema)
print("Collection created.") index_params = { "index_type": "IVF_FLAT", "metric_type": "L2", "params": {"nlist": 128}, } collection.create_index(field_name="embedding", index_params=index_params) collection.load() print("Collection indexed!")
create a function for generating the embeddings using the _SentenceTransformer_ for the text.
```python
from sentence_transformers import SentenceTransformer
import ast
# function to extract the text from genre column
# clean the genre column and get the text out of it.
def build_genres(data):
genres = data['genres']
genre_list = ""
entries= ast.literal_eval(genres)
genres = ""
for entry in entries:
genre_list = genre_list + entry["name"] + ", "
genres += genre_list
genres = "".join(genres.rsplit(",", 1))
return genres
# create an object of SentenceTransformer
transformer = SentenceTransformer('all-MiniLM-L6-v2')
# function to generate embeddings
def embed_movie(data):
embed = "{} Released on {}. Genres are {}.".format(data["overview"], data["release_date"], build_genres(data))
# the encode() method to generate the embeddings using the overview, release_date and genre features.
embeddings = transformer.encode(embed)
return embeddings
It is too large to send to Milvus in a single insert statement, but sending data rows one at a time would create unnecessary network traffic and add too much time. So, create batches (e.g., 5,000 rows) of data to send to Milvus: (You can play with the batch size to suit your individual needs and preferences.)
# Loop counter for batching and showing progress
j = 0
batch = []
for movie_dict in movies_dict:
try:
movie_dict["embedding"] = embed_movie(movie_dict)
batch.append(movie_dict)
j += 1
if j % 5 == 0:
print("Embedded {} records".format(j))
collection.insert(batch)
print("Batch insert completed")
batch=[]
except Exception as e:
print("Error inserting record {}".format(e))
pprint(batch)
break
collection.insert(movie_dict)
print("Final batch completed")
print("Finished with {} embeddings".format(j))
Recommend New Movies Using Milvus
# load collection memory before search
collection.load()
# Set search parameters
topK = 5
# metric_type as L2 (squared Euclidean) that calculates the distance between two vectors and
# nprobe that indicates the number of cluster units to search.
SEARCH_PARAM = {
"metric_type":"L2",
"params":{"nprobe": 20},
}
# convert search string to embeddings
def embed_search(search_string):
search_embeddings = transformer.encode(search_string)
return search_embeddings
# search similar embeddings for user's query
def search_for_movies(search_string):
user_vector = embed_search(search_string)
return collection.search([user_vector],"embedding",param=SEARCH_PARAM, limit=topK, expr=None, output_fields=['title', 'overview'])
embed_search()
: You need a transformer to convert the user’s search string to an embedding. This function takes the viewer’s criteria and passes it to the same transformer you used to populate Milvus.search_for_movies()
: This function performs the actual vector search using the other function for support.from pprint import pprint
search_string = "A comedy from the 1990s set in a hospital. The main characters are in their 20s and are trying to stop a vampire."
results = search_for_movies(search_string)
# check results
for hits in iter(results):
for hit in hits:
print(hit.entity.get('title'))
print(hit.entity.get('overview'))
print("-------------------------------")
What is a Vector Database?
This simplified grid captures the essence of how a vector database could be represented visually, even though the actual vector spaces might have many more dimensions and use sophisticated techniques for search and retrieval.
large language models, generative AI, semantic search 와 같은 데이터 처리가 중요한 어플리케이션들은 vector embeddings에 의존한다.
vectors는 유사한 objects는 서로 가까운 벡터 공간에 있고, 서로 다른 객체는 멀리 떨어진 벡터를 갖도록 설계되었다.
vector database에 질의를 할 때, 주어진 질의 embedding과 거리가 가까운, 유사한 객체들을 찾는다.
Vector Database Capabilities
Picking a vector database: a comparison and guide for 2023