Cosine similarity is not a distance

nicrie commented 8 months ago

Perhaps I'm wrong but shouldn't a distance matrix used for clustering have small values if the samples are close and large values when they are very different? With the current implementation of the cosine similarity, we obtain +1 for the same samples and -1 for very different samples. So I think we want to change the following

https://github.com/omadson/fuzzy-c-means/blob/3e57aa2386908bef413525c28eaa88dde47d132e/fcmeans/main.py#L146-L150

to something like

 def _cosine_similarity(A: NDArray, B: NDArray) -> NDArray:
      """Compute the cosine similarity between two matrices"""
      p1 = np.sqrt(np.sum(A**2,axis=1))[:,np.newaxis]
      p2 = np.sqrt(np.sum(B**2,axis=1))[np.newaxis,:]
      return np.dot(A,B.T) / (p1*p2)

def _cosine(A: NDArray, B: NDArray) -> NDArray:
    """Compute the cosine distance between two matrices"""
    return np.abs(1 - _cosine_similarity(A, B))

and then use use _cosine instead of _cosine_similarity for computing the distance matrix.

PS: I can open a PR if required

omadson commented 7 months ago

Hi @nicrie You are right. I fixed this issue and released it in the new version. Thank you very much for the information. Please update the package to see the changes.

nicrie commented 7 months ago

Awesome thanks for the update :)

omadson / fuzzy-c-means

Cosine similarity is not a distance #78