Closed piskvorky closed 8 years ago
Similarity Measures for Text Document Clustering by Anne Huang, 2008. http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf
I am trying to add another similarity function to gensim.docsim, but when I search for cossim in the source files , I got only two results: cossim in matutils.py and matutils.cossim in test_lee.py. So I am wandering how exactly the gensim caculate the similarity of the documents? And How can I add my own simi function for gensim? Thanks and looking forward to your advice!
The classes in docsim only support dot product + cossim.
You can add another similarity function by simply writing it and using it; or what sort of API do you need?
Or did you expect the Similarity
class to accept arbitrary sim functions as input?
Ahh, thanks. I got it. It is computed as dot product, in MatrixSimilarity result = numpy.dot(self.index, query.T).T
, and In SparseMatrixSimilarity, it is result = self.index * query.tocsc()
As you mentioned in #69 , I think it is a better way for 'humans' to choose when we need another method predifined in gensim.matutils
.
Thanks again for your nice work!
Well we can support other ways too, it was a honest question.
I'm just not sure what kind of functionality do people expect, how to structure the API. I'm not a fan engineering new functionality without clear use cases :)
Is it a reasonable approach to use cosine similarity for gensim lda models? Or should Hellinger's distance be highly prefered? If so, I would love to see support for it. :)
@methodds looking forward to the PR!
The code for Hellinger distance is really simple; see for example here: http://stackoverflow.com/questions/22433884/python-gensim-how-to-calculate-document-similarity-using-the-lda-model#answer-22756647
Hello, is there any particular reason the cossim
method is not being used in docsim
, and a simple dot product is used instead? Is it because the cossim
method in mathutils
is only for sparse vectors?
I looked at your stackoverflow answer and wrote a method for Hellinger- but it's very simple and I'm not sure if I'm going in the right direction.
def hellinger(vec1, vec2, num_topics):
dense1 = sparse2full(vec1,num_topics)
dense2 = sparse2full(vec2,num_topics)
sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
return sim
If I am, I can write up some test cases and submit a pull request. There are also some more generic implementations of Hellinger here - https://gist.github.com/larsmans/3116927
[edit: I imagine what I wrote in a hurry is only for an LDA distribution and is not very generic, as well]
@bhargavvader: Your version looks more generic that the one in the gist -- not sure what you mean there. It could be made even more generic if you accept more formats on input: dense (numpy), sparse (scipy.sparse), gensim vector (sequence of (id, weight)
).
Ping @tmylk -- good intro test, adding tiny sim metric functions like this into matutils
.
Hello, could you please have a look at this and tell me if I'm heading the right way?
def hellinger(vec1, vec2, lda=None):
if lda is None:
if issparse(vec1) and issparse(vec2):
dense1 = vec1.todense()
dense2 = vec2.todense()
sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
return sim
elif isinstance(vec1,numpy.ndarray) and isinstance(vec2,numpy.ndarray):
sim = numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
return sim
elif type(vec1) is list:
vec1, vec2 = dict(vec1), dict(vec2)
if len(vec2) < len(vec1):
vec1, vec2 = vec2, vec1 # swap references so that we iterate over the shorter vector
sim = numpy.sqrt(0.5*sum(numpy.sqrt(value) - numpy.sqrt(vec2.get(index, 0.0)) for index, value in iteritems(vec1)))
return sim
elif isinstance(lda,gensim.models.ldamodel.LdaModel):
dense1 = matutils.sparse2full(vec1,lda.num_topics)
dense2 = matutils.sparse2full(vec2,lda.num_topics)
sim = numpy.sqrt(0.5 * ((numpy.sqrt(dense1) - numpy.sqrt(dense2))**2).sum())
return sim
else:
# return some error
pass
This works fine for lda distribution vectors, for bag of words representations (I used an implementation similar to that in your cossim
method, does it look ok?) , and matrices which have the same dimensions. If it's an ok way of going ahead, I'll try and fix the problem with dense and sparse matrices with different dimensions.
If this approach is fine, I could also add something simple like Jaccard Coefficient for gensim vectors (bag of words).
@tmylk , should I go ahead and submit a PR for this? And any suggestion on dealing with matrices of different dimensions?
@piskvorky what are the steps ahead for this?
Deferring to @tmylk .
Implemented in https://github.com/RaRe-Technologies/gensim/issues/656
I wanted to know if there was a way for the SIMILARITY class to accept other similarity functions besides the cosine similarity. For example, I want to write my own similarity function and substitute this function for the cosine similarity function in the gensim package.
Any help would be greatly appreciated. @piskvorky
https://github.com/RaRe-Technologies/gensim/issues/64#issuecomment-150719000
@jvazquez2, unfortunately no way, you can implement your own class using, for example https://github.com/RaRe-Technologies/gensim/blob/2ce4699e048a4bb02be06b0412a42da9bd7fbdfe/gensim/similarities/docsim.py#L722 as base class, or simply calculate all manually.
Currently, Similarity works purely over cosine similarity (~the angle between query and indexed document).
Make this more general, using e.g. Hellinger distance for models that represent the documents as probability distributions.
At the same time, try to still keep things computationally efficient (using BLAS & mmap etc.).