irthomasthomas / undecidability

1 stars 0 forks source link

Fast Vector Similarity Library: Efficient Computation for Data Analysis and Machine Learning #801

Open ShellLM opened 1 month ago

ShellLM commented 1 month ago

Fast Vector Similarity Library: Efficient Computation for Data Analysis and Machine Learning

Introduction

The Fast Vector Similarity Library is designed to provide efficient computation of various similarity measures between vectors. It is suitable for tasks such as data analysis, machine learning, and statistics, where measuring the relationship between vectors is essential. Written in Rust, the library offers a high-performance solution and can be easily integrated with Python through provided bindings.

Features

Similarity Measures

The library implements several popular similarity measures, including:

Bootstrapping Technique

The library supports bootstrapping for robust similarity computation. This technique involves randomly resampling the dataset to estimate the distribution of the similarity measures, providing more confidence in the results.

Performance Optimizations

To achieve high efficiency, the library leverages:

Python Bindings

The library includes Python bindings to enable seamless integration with Python code. Functions py_compute_vector_similarity_stats and py_compute_bootstrapped_similarity_stats are exposed, allowing you to compute similarity statistics and bootstrapped similarity statistics, respectively.

Installation

Rust

Include the library in your Rust project by adding it to your Cargo.toml file.

Python

You can install the Python bindings for the Fast Vector Similarity Library directly from PyPI using the following command:

pip install fast_vector_similarity

This command will download and install the package, making it available for use in your Python projects.

Use with Text Embedding Vectors from LLMs

The Python demo code below introduces additional functionality that allows the Fast Vector Similarity Library to easily work with text embedding vectors from Language Models like LLMs (e.g., Llama2), such as those generated by my other library, the Llama2 Embeddings FastAPI Service (particularly, the output of the /get_all_embedding_vectors_for_document endpoint):

1. Converting Text Embeddings to Pandas DataFrame

The function convert_embedding_json_to_pandas_df reads text embeddings from a JSON file and converts them into a Pandas DataFrame. Each embedding is associated with a specific text (e.g., a sentence from Shakespeare's Sonnets), and the DataFrame structure facilitates manipulation and analysis of these embeddings.

2. Applying Fast Vector Similarity to Text Embeddings

The function apply_fvs_to_vector takes a row of embeddings and a query embedding, applies the chosen similarity measures, and returns the results as a JSON object. This function leverages the core similarity computation functionality provided by the Fast Vector Similarity Library, allowing it to be used directly on text embeddings.

3. Comparing Embeddings with Large Datasets

The main section of the code includes an example where text embeddings from Llama2 are loaded and analyzed. This part of the code demonstrates how to:

4. Compatibility with Higher-Dimensional Embeddings

The example works with 4096-dimensional vectors, showcasing the library's ability to handle high-dimensional data typical of modern language models. This compatibility ensures that the library can be used with a wide range of language models and text embedding techniques. In fact, the library can easily handle even larger dimensions.

Example Python Code

import time
import numpy as np
import json
import pandas as pd
from random import choice
import fast_vector_similarity as fvs

def convert_embedding_json_to_pandas_df(file_path):
    # Read the JSON file
    with open(file_path, 'r') as file:
        data = json.load(file)
    # Extract the text and embeddings
    texts = [item['text'] for item in data]
    embeddings = [item['embedding'] for item in data]
    # Determine the total number of vectors and the dimensions of each vector
    total_vectors = len(embeddings)
    vector_dimensions = len(embeddings[0]) if total_vectors > 0 else 0
    # Print the total number of vectors and dimensions
    print(f"Total number of vectors: {total_vectors}")
    print(f"Dimensions of each vector: {vector_dimensions}")
    # Convert the embeddings into a DataFrame
    df = pd.DataFrame(embeddings, index=texts)
    return df

def apply_fvs_to_vector(row_embedding, query_embedding):
    params = {
        "vector_1": query_embedding.tolist(),
        "vector_2": row_embedding.tolist(),
        "similarity_measure": "all"
    }
    similarity_stats_str = fvs.py_compute_vector_similarity_stats(json.dumps(params))
    return json.loads(similarity_stats_str)

def main():
    length_of_test_vectors = 15000
    print(f"Generating 2 test vectors of length {length_of_test_vectors}...")
    vector_1 = np.linspace(0., length_of_test_vectors - 1, length_of_test_vectors)
    vector_2 = vector_1 ** 0.2 + np.random.rand(length_of_test_vectors)
    print("Generated vector_1 using linear spacing and vector_2 using vector_1 with a power of 0.2 and some random noise.\n")

    similarity_measure = "all" # Or specify a particular measure
    params = {
        "vector_1": vector_1.tolist(),
        "vector_2": vector_2.tolist(),
        "similarity_measure": similarity_measure
    }

    # Time the exact similarity calculation
    print("Computing Exact Similarity Measures...")
    start_time_exact = time.time()
    similarity_stats_str = fvs.py_compute_vector_similarity_stats(json.dumps(params))
    similarity_stats_json = json.loads(similarity_stats_str)
    elapsed_time_exact = time.time() - start_time_exact
    print(f"Time taken for exact calculation: {elapsed_time_exact:.5f} seconds")

    # Print results
    print("_______________________________________________________________________________________________________________________________________________\n")
    print("Spearman's rho:", similarity_stats_json["spearman_rho"])
    print("Kendall's tau:", similarity_stats_json["kendall_tau"])
    print("Distance Correlation:", similarity_stats_json["approximate_distance_correlation"])
    print("Jensen-Shannon Similarity:", similarity_stats_json["jensen_shannon_similarity"])
    print("Hoeffding's D:", similarity_stats_json["hoeffding_d"])
    print("_______________________________________________________________________________________________________________________________________________\n")

    # Bootstrapped calculations
    number_of_bootstraps = 2000
    n = 15
    sample_size = int(length_of_test_vectors / n)

    print(f"Computing Bootstrapped Similarity Measures with {number_of_bootstraps} bootstraps and a sample size of {sample_size}...")
    start_time_bootstrapped = time.time()
    params_bootstrapped = {
        "x": vector_1.tolist(),
        "y": vector_2.tolist(),
        "sample_size": sample_size,
        "number_of_bootstraps": number_of_bootstraps,
        "similarity_measure": similarity_measure
    }
    bootstrapped_similarity_stats_str = fvs.py_compute_bootstrapped_similarity_stats(json.dumps(params_bootstrapped))
    bootstrapped_similarity_stats_json = json.loads(bootstrapped_similarity_stats_str)
    elapsed_time_bootstrapped = time.time() - start_time_bootstrapped
    print(f"Time taken for bootstrapped calculation: {elapsed_time_bootstrapped:.5f} seconds")

    time_difference = abs(elapsed_time_exact - elapsed_time_bootstrapped)
    print(f"Time difference between exact and robust bootstrapped calculations: {time_difference:.5f} seconds")

    # Print bootstrapped results
    print("_______________________________________________________________________________________________________________________________________________\n")
    print("Number of Bootstrap Iterations:", bootstrapped_similarity_stats_json["number_of_bootstraps"])
    print("Bootstrap Sample Size:", bootstrapped_similarity_stats_json["sample_size"])
    print("\nRobust Spearman's rho:", bootstrapped_similarity_stats_json["spearman_rho"])
    print("Robust Kendall's tau:", bootstrapped_similarity_stats_json["kendall_tau"])
    print("Robust Distance Correlation:", bootstrapped_similarity_stats_json["approximate_distance_correlation"])
    print("Robust Jensen-Shannon Similarity:", bootstrapped_similarity_stats_json["jensen_shannon_similarity"])
    print("Robust Hoeffding's D:", bootstrapped_similarity_stats_json["hoeffding_d"])
    print("_______________________________________________________________________________________________________________________________________________\n")

    # Compute the differences between exact and bootstrapped results
    measures = ["spearman_rho", "kendall_tau", "approximate_distance_correlation", "jensen_shannon_similarity", "hoeffding_d"]
    for measure in measures:
        exact_value = similarity_stats_json[measure]
        bootstrapped_value = bootstrapped_similarity_stats_json[measure]
        absolute_difference = abs(exact_value - bootstrapped_value)
        percentage_difference = (absolute_difference / exact_value) * 100

        print(f"\nDifference between exact and bootstrapped {measure}: {absolute_difference}")
        print(f"Difference as % of the exact value: {percentage_difference:.2f}%")

    print("Now testing with a larger dataset, using sentence embedddings from Llama2 (4096-dimensional vectors) on some Shakespeare Sonnets...")
    # Load the embeddings into a DataFrame
    input_file_path = "sample_input_files/Shakespeare_Sonnets_small.json"
    embeddings_df = convert_embedding_json_to_pandas_df(input_file_path)

    # Select a random row for the query embedding
    query_embedding_index = choice(embeddings_df.index)
    query_embedding = embeddings_df.loc[query_embedding_index]
    print(f"Selected query embedding for sentence: `{query_embedding_index}`")

    # Remove the selected row from the DataFrame
    embeddings_df = embeddings_df.drop(index=query_embedding_index)

    # Apply the function to each row of embeddings_df
    json_outputs = embeddings_df.apply(lambda row: apply_fvs_to_vector(row, query_embedding), axis=1)

    # Create a DataFrame from the list of JSON outputs
    vector_similarity_results_df = pd.DataFrame.from_records(json_outputs)
    vector_similarity_results_df.index = embeddings_df.index

    # Add the required columns to the DataFrame
    columns = ["spearman_rho", "kendall_tau", "approximate_distance_correlation", "jensen_shannon_similarity", "hoeffding_d"]
    vector_similarity_results_df = vector_similarity_results_df[columns]

    # Sort the DataFrame by the hoeffding_d column in descending order
    vector_similarity_results_df = vector_similarity_results_df.sort_values(by="hoeffding_d", ascending=False)

    print("\nTop 10 most similar embedding results by Hoeffding's D:")
    print(vector_similarity_results_df.head(10))

Usage

In Rust

You can utilize the core functionality in Rust by calling functions like compute_vector_similarity_stats and compute_bootstrapped_similarity_stats with appropriate parameters.

In Python

Install the provided Python package and use the functions as demonstrated in the example above.

About the Different Similarity Measures

Each of these measures has unique properties and applicability, providing a comprehensive toolkit for understanding the relationships between variables in different contexts. By including both classical and more specialized measures, the library offers flexibility and depth in analyzing vector similarities.

Bootstrapping Technique for Robust Estimators

Bootstrapping is a powerful statistical method that allows us to estimate the distribution of a statistic by repeatedly resampling with replacement from the observed data. In the context of the Fast Vector Similarity Library, bootstrapping is employed to obtain robust estimators of similarity measures between vectors. Here's how it works:

Process

Why It's Useful

The bootstrapping technique offers several advantages:

The inclusion of the bootstrapping technique in the Fast Vector Similarity Library adds a layer of robustness and flexibility to the computation of similarity measures. It offers users a way to obtain more reliable and insightful estimates, accommodating various data characteristics and analytical needs. Whether dealing with potential outliers or seeking a more nuanced understanding of vector relationships, bootstrapping provides a valuable tool within the library's comprehensive suite of features.

For more information and to access the library, visit Fast Vector Similarity Library on GitHub.

Suggested labels

None