gkamradt / SemanticDeduplicator

MIT License
87 stars 12 forks source link

Add Zero Vector Check in Cosine Similarity Calculation #29

Open ayush-vibrant opened 1 year ago

ayush-vibrant commented 1 year ago

While the cosine_similarity method works effectively for most vectors, there's a potential edge case that could lead to a division by zero error. (I agree that it's a rare case, but we can handle it via a simple conditional)

Addressing this will enhance the stability of the cosine similarity calculation, especially for edge cases.

The updated code would look something like this:

def cosine_similarity(self, item, existing_item):
    # Calculate the dot product of the two vectors
    dot_product = np.dot(item.item_embedding, existing_item.item_embedding)

    # Calculate the norms of each vector
    norm_1 = np.linalg.norm(item.item_embedding)
    norm_2 = np.linalg.norm(existing_item.item_embedding)

    # Check if both norms are zero to prevent division by zero
    if norm_1 == 0 or norm_2 == 0:
        return 0  # Returning 0 as the similarity value when one of the vectors is a zero vector

    # Calculate the cosine similarity
    cosine_similarity = dot_product / (norm_1 * norm_2)

    return cosine_similarity

Happy to discuss on what should be the actual return value in case of norm_1 == 0 or norm_2 == 0:

ayush-vibrant commented 1 year ago

This is the current implementation:

    def cosine_similarity(self, item, existing_item):
        # Calculate the dot product of the two vectors
        dot_product = np.dot(item.item_embedding, existing_item.item_embedding)

        # Calculate the norms of each vector
        norm_1 = np.linalg.norm(item.item_embedding)
        norm_2 = np.linalg.norm(existing_item.item_embedding)

        # Calculate the cosine similarity
        cosine_similarity = dot_product / (norm_1 * norm_2)

        return cosine_similarity