Add Zero Vector Check in Cosine Similarity Calculation

While the cosine_similarity method works effectively for most vectors, there's a potential edge case that could lead to a division by zero error. (I agree that it's a rare case, but we can handle it via a simple conditional)

Issue: If both input vectors are zero vectors, their norms will be zero. This leads to a division by zero during the cosine similarity calculation.
Suggestion: Implement a conditional check to handle the case where both vectors are zero vectors. This can prevent potential runtime errors and ensure the method's robustness.

Addressing this will enhance the stability of the cosine similarity calculation, especially for edge cases.

The updated code would look something like this:

def cosine_similarity(self, item, existing_item):
    # Calculate the dot product of the two vectors
    dot_product = np.dot(item.item_embedding, existing_item.item_embedding)

    # Calculate the norms of each vector
    norm_1 = np.linalg.norm(item.item_embedding)
    norm_2 = np.linalg.norm(existing_item.item_embedding)

    # Check if both norms are zero to prevent division by zero
    if norm_1 == 0 or norm_2 == 0:
        return 0  # Returning 0 as the similarity value when one of the vectors is a zero vector

    # Calculate the cosine similarity
    cosine_similarity = dot_product / (norm_1 * norm_2)

    return cosine_similarity

Happy to discuss on what should be the actual return value in case of norm_1 == 0 or norm_2 == 0:

gkamradt / SemanticDeduplicator

Add Zero Vector Check in Cosine Similarity Calculation #29