Add other distance measures to Similarity

piskvorky commented 13 years ago

Currently, Similarity works purely over cosine similarity (~the angle between query and indexed document).

Make this more general, using e.g. Hellinger distance for models that represent the documents as probability distributions.

At the same time, try to still keep things computationally efficient (using BLAS & mmap etc.).

piskvorky commented 12 years ago

Similarity Measures for Text Document Clustering by Anne Huang, 2008. http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf

cmooony commented 9 years ago

I am trying to add another similarity function to gensim.docsim, but when I search for cossim in the source files , I got only two results: cossim in matutils.py and matutils.cossim in test_lee.py. So I am wandering how exactly the gensim caculate the similarity of the documents? And How can I add my own simi function for gensim? Thanks and looking forward to your advice!

piskvorky commented 9 years ago

The classes in docsim only support dot product + cossim.

You can add another similarity function by simply writing it and using it; or what sort of API do you need?

Or did you expect the Similarity class to accept arbitrary sim functions as input?

cmooony commented 9 years ago

Ahh, thanks. I got it. It is computed as dot product, in MatrixSimilarity result = numpy.dot(self.index, query.T).T, and In SparseMatrixSimilarity, it is result = self.index * query.tocsc() As you mentioned in #69 , I think it is a better way for 'humans' to choose when we need another method predifined in gensim.matutils. Thanks again for your nice work!

piskvorky commented 9 years ago

Well we can support other ways too, it was a honest question.

I'm just not sure what kind of functionality do people expect, how to structure the API. I'm not a fan engineering new functionality without clear use cases :)

cschwem2er commented 8 years ago

Is it a reasonable approach to use cosine similarity for gensim lda models? Or should Hellinger's distance be highly prefered? If so, I would love to see support for it. :)

piskvorky commented 8 years ago

@methodds looking forward to the PR!

The code for Hellinger distance is really simple; see for example here: http://stackoverflow.com/questions/22433884/python-gensim-how-to-calculate-document-similarity-using-the-lda-model#answer-22756647

bhargavvader commented 8 years ago

Hello, is there any particular reason the cossim method is not being used in docsim, and a simple dot product is used instead? Is it because the cossim method in mathutils is only for sparse vectors?

I looked at your stackoverflow answer and wrote a method for Hellinger- but it's very simple and I'm not sure if I'm going in the right direction.

def hellinger(vec1, vec2, num_topics):
    dense1 = sparse2full(vec1,num_topics)
    dense2 = sparse2full(vec2,num_topics)
    sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
    return sim

If I am, I can write up some test cases and submit a pull request. There are also some more generic implementations of Hellinger here - https://gist.github.com/larsmans/3116927

[edit: I imagine what I wrote in a hurry is only for an LDA distribution and is not very generic, as well]

piskvorky commented 8 years ago

@bhargavvader: Your version looks more generic that the one in the gist -- not sure what you mean there. It could be made even more generic if you accept more formats on input: dense (numpy), sparse (scipy.sparse), gensim vector (sequence of (id, weight)).

Ping @tmylk -- good intro test, adding tiny sim metric functions like this into matutils.

bhargavvader commented 8 years ago

Hello, could you please have a look at this and tell me if I'm heading the right way?

def hellinger(vec1, vec2, lda=None):
    if lda is None:
        if issparse(vec1) and issparse(vec2):
            dense1 = vec1.todense()
            dense2 = vec2.todense()
            sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
            return sim
        elif isinstance(vec1,numpy.ndarray) and isinstance(vec2,numpy.ndarray):
            sim = numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
            return sim
        elif type(vec1) is list:
            vec1, vec2 = dict(vec1), dict(vec2)
            if len(vec2) < len(vec1):
                vec1, vec2 = vec2, vec1 # swap references so that we iterate over the shorter vector
            sim = numpy.sqrt(0.5*sum(numpy.sqrt(value) - numpy.sqrt(vec2.get(index, 0.0)) for index, value in iteritems(vec1)))
            return sim
    elif isinstance(lda,gensim.models.ldamodel.LdaModel):
        dense1 = matutils.sparse2full(vec1,lda.num_topics)
        dense2 = matutils.sparse2full(vec2,lda.num_topics)
        sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
        return sim
    else:
        # return some error
                pass

This works fine for lda distribution vectors, for bag of words representations (I used an implementation similar to that in your cossim method, does it look ok?) , and matrices which have the same dimensions. If it's an ok way of going ahead, I'll try and fix the problem with dense and sparse matrices with different dimensions.

bhargavvader commented 8 years ago

If this approach is fine, I could also add something simple like Jaccard Coefficient for gensim vectors (bag of words).

bhargavvader commented 8 years ago

@tmylk , should I go ahead and submit a PR for this? And any suggestion on dealing with matrices of different dimensions?

bhargavvader commented 8 years ago

@piskvorky what are the steps ahead for this?

piskvorky commented 8 years ago

Deferring to @tmylk .

tmylk commented 8 years ago

Implemented in https://github.com/RaRe-Technologies/gensim/issues/656

jesusepfvazquez commented 6 years ago

I wanted to know if there was a way for the SIMILARITY class to accept other similarity functions besides the cosine similarity. For example, I want to write my own similarity function and substitute this function for the cosine similarity function in the gensim package.

Any help would be greatly appreciated. @piskvorky

https://github.com/RaRe-Technologies/gensim/issues/64#issuecomment-150719000

menshikh-iv commented 6 years ago

@jvazquez2, unfortunately no way, you can implement your own class using, for example https://github.com/RaRe-Technologies/gensim/blob/2ce4699e048a4bb02be06b0412a42da9bd7fbdfe/gensim/similarities/docsim.py#L722 as base class, or simply calculate all manually.

piskvorky / gensim

Add other distance measures to Similarity #64