SoftCosineSimilarity: Unexpected Behavior for Query Input

nataliedurgin commented 6 years ago

Description

Expected SoftCosineSimilarity self[query] syntax to accept any iterable of list of (int, number) as indicated by: https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/similarities/docsim.py#L925

Steps/Code/Corpus to Reproduce

import scipy as sp
import gensim

# import the common_corpus as a TextCorpus object
# We assume common_corpus is a text file containing the lines:
#
# Human machine interface for lab abc computer applications
# A survey of user opinion of computer system response time
# The EPS user interface management system
# System and human system engineering testing of EPS
# Relation of user perceived response time to error measurement
# The generation of random binary unordered trees
# The intersection graph of paths in trees
# Graph minors IV Widths of trees and well quasi ordering
# Graph minors A survey

corpus = gensim.corpora.TextCorpus(input="path/common_corpus")
tfidf = gensim.models.TfidfModel(
    corpus, id2word=corpus.dictionary, normalize=True)

# Behavior of MatrixSimilarity
cos_index = gensim.similarities.MatrixSimilarity(tfidf[corpus])
cos_index[tfidf[corpus]]  # :)

# Behavior of SoftCosineSimilarity
dim = len(corpus.dictionary)
M = sp.sparse.identity(dim).tocsc() # generate the identity matrix
softcos_index = gensim.similarities.SoftCosineSimilarity(tfidf[corpus], M)
softcos_index[tfidf[corpus]] # :(
softcos_index[tfidf[[bow for bow in corpus]]] # :|
softcos_index[[vec for vec in tfidf[corpus]]] # :|

Expected Results

I believe that the TransformedCorpus tfidf[corpus] is an iterable of list of (int, number):

In [248]: next(iter(tfidf[corpus]))
Out[248]:
[(0, 0.4500498962785324),
 (1, 0.4500498962785324),
 (2, 0.3080749612015952),
 (3, 0.3080749612015952),
 (4, 0.4500498962785324),
 (5, 0.4500498962785324)]

I would expect that to be taken as query input without incident. It seems that SoftCosineSimilarity was adapted from a copy of MatrixSimilarity and was meant to function in a similar spirit, however, they do not accept the same types of input.

Actual Results

In [249]: softcos_index[tfidf[corpus]]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-249-2d8c4d759cc9> in <module>()
----> 1 softcos_index[tfidf[corpus]]

/Users/natalied/Projects/canonical_answers/.venv/lib/python2.7/site-packages/gensim/interfaces.pyc in __getitem__(self, query)
    365                 else:
    366                     query = matutils.unitvec(query)
--> 367         result = self.get_similarities(query)
    368
    369         if self.num_best is None:

/Users/natalied/Projects/canonical_answers/.venv/lib/python2.7/site-packages/gensim/similarities/docsim.pyc in get_similarities(self, query)
    936             query = [self.corpus[i] for i in query]
    937
--> 938         if not query or not isinstance(query[0], list):
    939             query = [query]
    940

/Users/natalied/Projects/canonical_answers/.venv/lib/python2.7/site-packages/gensim/interfaces.pyc in __getitem__(self, docno)
    215             return self.obj[self.corpus[docno]]
    216         else:
--> 217             raise RuntimeError('Type {} does not support slicing.'.format(type(self.corpus)))
    218
    219

RuntimeError: Type <class 'gensim.corpora.textcorpus.TextCorpus'> does not support slicing.

Possible Code Involved

Compare the MatrixSimilarity get_similarities method: https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/similarities/docsim.py#L825 with the SoftCosineSimilarity get_similarities method: https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/similarities/docsim.py#L938

Also we may need to expand this test: https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/test/test_similarities.py#L377

Versions

Darwin-15.6.0-x86_64-i386-64bit
('Python', '2.7.13 (default, Apr  4 2017, 08:46:44) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]')
('NumPy', '1.14.1')
('SciPy', '1.0.0')
('gensim', '3.4.0')
('FAST_VERSION', 0)

piskvorky commented 6 years ago

Thanks for reporting. CC @Witiko

Witiko commented 6 years ago

Just to expand the original issue report, the get_similarities method of SoftCosineSimilarity was taken verbatim from WmdSimilarity. Any digression from the expected behavior is likely to impact that class as well.

nataliedurgin commented 6 years ago

Not sure if this is the correct place to put this info, but I also noticed that these related unit tests do not seem to be run:

https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/test/test_keyedvectors.py#L30

And if they did, I am not sure this is correctly checking for ones on the diagonal of the term similarity matrix:

https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/test/test_keyedvectors.py#L40

Witiko commented 6 years ago

You are correct, @nataliedurgin. Additionally, if the test were run, they would have failed at https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/test/test_keyedvectors.py#L38 since the up-to-date signature of similarity_matrix is similarity_matrix(dictionary, tfidf, …). Do you have any idea why the tests are not executed by the CI services, @menshikh-iv?

nataliedurgin commented 6 years ago

I think unittest might require methods to be prefixed by 'test_' in order to find them. The tests might be formatted like this: https://github.com/nataliedurgin/gensim/commit/5bb6aea3a92b8e7ff7cecfca054a6057bb9b8fa8

All seem to pass in that case except for checking that there are ones on the diagonals. (Not to say they shouldn't be expanded.)

(Sorry I wasn't working on this issue. I was working on making a Levenshtein distance term similarity matrix function to feed into softcosine and was cannibalizing some of the existing term similarity matrix functionality.)

Witiko commented 6 years ago

Nice catch! This seems to be a test case that I sketched and then forgot about, since the CI service did not complain. It is somewhat unrelated to the original issue, but should be fixed nevertheless.

piskvorky commented 6 years ago

@Witiko @nataliedurgin can you fix it?

Witiko commented 6 years ago

@piskvorky Not earlier than this weekend. @nataliedurgin has already fixed the unrelated tests, but the original issue still needs fixing and making into a new test case.

Witiko commented 6 years ago

(Sorry I wasn't working on this issue. I was working on making a Levenshtein distance term similarity matrix function to feed into softcosine and was cannibalizing some of the existing term similarity matrix functionality.)

Most of the method should be directly reusable if you are going for the same greedy algorithm that achieves a constant number of non-zero elements per a row / column. If you are going just for a constant threshold, like Sidorov et al. (2014) did, you will end up with something much simpler.

Building the similarity matrix off the Levenshtein distance achieved competitive results to word embeddings for SimBow at SemEval 2017 and in my experiments as well; the inclusion of a corresponding method / function in Gensim could be beneficial.

nataliedurgin commented 6 years ago

Those actually were the relevant tests, but because the diagonals test case still needed to be corrected, I didn't necessarily want to take responsibility for it. I guess I've written that test for my purposes already too, so it won't be a bother. I'll open another ticket for just the tests and close it in my eventual pull request.

@Witiko Exactly my reasoning. I am implementing aspects of the Simbow paper, including Levenshtein, and saw that there was precedent to add it to gensim because softcosine had already been done. I am almost finished. (Though I am sure performance improvements could be made.)

https://github.com/nataliedurgin/gensim/commits/levenshtein-term-similarity-matrix

nataliedurgin commented 6 years ago

This issues surrounding unit tests for similarity_matrix can now be found #1961 and #1960.

piskvorky commented 6 years ago

@nataliedurgin sorry, I'm not familiar with the application here, but what edit distance functionality is actually needed? What operations specifically?

Is it a string-string query, or string-set_of_strings query? Is there an upper threshold on the "maximum relevant distance"? Is it really Levenshtein, or would a better edit-distance model help (such as discounting errors due to keyboard typos, omitted syllables, pronunciation similarity etc).

I'm asking because we've done a tremendous amount of work in this area in the past. We have some pretty well optimized algorithms and implementations for the various use-cases that fall under or expand "Levenshtein distance".

nataliedurgin commented 6 years ago

@piskvorky That would be fantastic if it exists. I couldn't find it in the repo. The application in SimBow at SemEval 2017 http://www.aclweb.org/anthology/S17-2051 (the paper mentioned by @Witiko) is constructing a word-word edit distance similarity matrix. I am not actually implementing levenshtein, but planned to use either pylev.levenshtein or distance.levenshtein (which would perhaps create unwanted external dependencies). I will happily substitute any method that takes in two word strings and spits back some notion of edit-distance. I am only implementing the functionality to take a Dictionary and output the term similarity matrix that can be fed into SoftCosineSimilarity. I need it for work regardless, but since there seemed to be existing interest in other portions of that paper, thought that I would offer it to the community. No worries either way!

nataliedurgin commented 6 years ago

Aside: every time I read "softcos" my brain turns it into "soft tacos" and I get hungry.

piskvorky commented 6 years ago

Well that's my question: what are the operations actually needed in this app? Are we comparing two static sets of words (which can be done much faster than a menial pairwise computation)?

Or do we have one static set of words and comparing that against a dynamic unknown string? If that's the case, some structures can be precomputed that make the dynamic distance computations much faster. Do we need only top-N most similar words? Everything under distance <= 3 (ignoring very distant words)?

It is rather rare that a full slavish N^2 Levenshtein is actually needed. And the runtime implications can be huge.

nataliedurgin commented 6 years ago

Gotcha. You are correct that what I really want is to input a pair of documents and output a score related to their edit distance. Since I have no idea how sensitive the results in the paper are to this particular method of creating a levenshtein term similarity matrix as input into soft cosine, my first instinct was to create a literal implementation. I'll hold off on the pull request and experiment internally with both a literal implementation and an existing gensim edit distance based similarity.

At the very least this endevour has uncovered some bugs in existing functionality.

Would you be willing to point me in the direction of existing edit-distance functionality?

Thanks so much for your help!

Witiko commented 6 years ago

Are we comparing two static sets of words (which can be done much faster than a menial pairwise computation)?

@piskvorky If we are actually building a term similarity matrix, then we will be computing the edit distance between all two-sets of words in the dictionary (unless we pre-filter some) regardless of the end application. This can likely be optimized already.

piskvorky commented 6 years ago

Aha, so we're comparing two static sets of words?

And do we care about all the distances = filling this word x word distance matrix fully, or only a subset, such as only top-N smallest values in each row, or only cell values <= 3 etc?

piskvorky commented 6 years ago

@nataliedurgin none of that work is open sourced (yet). I'm just trying to figure out if it's worth it, what the actual use case is here :)

Witiko commented 6 years ago

Definitely only top-N closest in each row.

piskvorky commented 6 years ago

Cool, that should optimize very nicely. Nothing urgent -- IMO let's get the algo correct first and check the performance. See if it's actually worth optimizing = how much difference that would make in reality.

Witiko commented 6 years ago

@piskvorky The fires have been extinguished. I am currently writing code for building the Levenshtein similarity matrix and I will issue a pull request once the code is ready for review. A parallelized naive implementation running at 32 cores builds a 462807² term similarity matrix in about eight days, which puts us at about 310K word pairs per second. This is quite slow, since it would take about three years to build a term similarity matrix from the English Wikipedia dictionary of 5M words at this speed. 10- to 100-fold speed increase is necessary to make this practical.

Witiko commented 6 years ago

Ping @nataliedurgin in case this is still relevant for you.

nataliedurgin commented 6 years ago

@Witiko Thanks so much! Definitely still interested. Out of curiosity, what do you feel is the bottle-neck for any given chunk with your naive implementation? Is it the distance function itself? Or something else about the bookkeeping/fileio?

TL;DR: See bold text. I also had a few questions related to performance of SoftCosineSimilarity. I know that the function requires the the index built can be held in memory. I made a preliminary attempt at all-pairs on a corpus of ~200K documents of dimension ~150K. A word2vec term similarity matrix, M, for that size vocabulary finished no problem. I put the documents into SoftCosineSimilarity and the index built almost instantaneously (which surprised me...some kind of functional deferral of the work? It seems like MatrixSimilarity takes its time crunching the index up front?). Then I set running overnight an index[corpus]...came back the next morning and it was not done. So I queried a single document from that corpus against the index. This took 5min on my laptop. Now, in theory, elsewhere in the repo providing the whole corpus is optimized over running a query at a time. If that is not the case here, then to obtain similarity scores for 200K documents would take ~700 days? Even with corpus optimization would that be brought down to a realistic time-frame?

I realize I would need to be much more specific if we were going to get into this. However, my high-level question is, would you have expected this size of a corpus/vocabulary to have been completed in a reasonable amount of time? Was SoftCosineSimilarity meant to be more of a prototype and really one would need to adapt the sharded Similarity class to a soft cosine to do something more hefty?

As a quick(ish) work-around, I was able to stay in scipy land and implement the softcosine formula (equation 2 in the simbow paper) in an alternative way. I kept the current implementation to obtain M which spits out a sparse matrix. I used the corpus2csc functionality to get a scipy sparse matrix representation of the corpus of documents, X_i. Then computed the scipy sparse product X_i^tM once and only once for each i, recycling them to get the normalizing constants N_i= \sqrt{(X_i^tM)X_i} for each document. Finished with two sparse matrices with rows, (X_i^tM)/N_i, X_i/N_i, the former is pretty dense now, the latter is still super sparse. Then I fed into sklearn.metrics.pairwise.linear_kernel (pitting 1000 of the X_i^tM/N_i at a time against all the 800K sparse X_i/N_i) to get the final similarity scores. This returned all pairs for the above corpus/vocabulary in about an hour on my laptop. Increasing the size of the corpus to ~800K with a vocab of ~250K fell over because now the matrix of all the X_i^tM/N doesn't fit in memory. The next step was to chunk the computation of the X_i^tM/N (computing those for about 1000 i at a time). Geting all the X_i^tM/N on disk took about 3 hours on my laptop. Subsequently getting everything through the linear_kernel in the same manner, all-pairs softcosine similarity scores for the larger corpus finish in about an additional 3 hours.

This is, of course, essentially the same thing, except you actually physically shrink the sparse objects to do the multiplications. You do seemingly repeat the vec1.T.dot(M) twice,

https://github.com/RaRe-Technologies/gensim/blob/10a3dab8d00c0523ff871af75fb0badcff14848b/gensim/matutils.py#L836

https://github.com/RaRe-Technologies/gensim/blob/10a3dab8d00c0523ff871af75fb0badcff14848b/gensim/matutils.py#L844

which initially I thought was the problem, but after profiling SoftCosineSimilarity, it doesn't seem to be the bottleneck at all (does the python interpreter understand how to cache that computation?).

I realize I am playing fast and loose in the above communication. If it magically parses, great. If not, the bold I believe is a reasonable question and what I am interested in so I can understand where I went wrong when using your existing implementation, or if that performance is to be expected.

Witiko commented 6 years ago

@nataliedurgin The Levenshtein distance itself, I would say, but I haven't done any profiling yet.

As for the other part (at least the main question part of it), you are right on all accounts. SoftCosineSimilarity currently computes soft cosine measure on document basis by repeatedly calling softcossim. Yes, computing CMC^T, where C is the corpus, is going to be considerably faster and was discussed in the soft cosine measure pull request #1827 as a possible direction for future development. Yes, SoftCosineSimilarity is currently just a naive wrapper / toy that is ill-fitted to the task of computing soft cosine measure between all pairs of documents in a corpus.

nataliedurgin commented 6 years ago

Oh Sweet! I hadn't seen this pull request. Great discussion.

piskvorky commented 6 years ago

@menshikh-iv @Witiko where does this naive/toy wrapper live currently? Is it in experimental?

We don't want toy things in core gensim.

Witiko commented 6 years ago

@piskvorky That would be the SoftCosineSimilarity class in gensim/similarities/docsim.py. Currently, the get_similarities method computes the similarities one by one instead of performing a pair of matrix products on a corpus matrix (similar to what MatrixSimilarity.get_similarities currently does). This was made known in the original Soft Cosine Measure pull request, see the Future work section of the first post in #1827. The method can be sped up without changing the current interface.

nataliedurgin commented 6 years ago

@Witiko I attended a talk last night that got me thinking. Maybe we don't need Levenshtein. The spirit of it's use in the Simbow paper is to account for noisy data, misspellings etc.

In my application of this paper, I had always planned on computing multiple sets of soft cosine similarity scores with M's coming from several different types of word-embeddings, including Fasttext. I was reminded that since fasttext works at the character level, it works well to associate misspellings etc.

In general, I think Levenshtein is a more transparent/organic/low-level approach to addressing this issue and I tend to prioritize those types of solutions where possible. However, in this case, since the fasttext functionality is implemented and very mature (and the term similarity matrix computation for the word vectors seems to perform well), it is actually the more accessible solution.

If one were to repeat the Simbow paper process replacing the M_lev with some flavor of an M_fasttext, I wonder if the results wouldn't be the same or better. This seems like something we could verify.

What are your thoughts?

Witiko commented 6 years ago

@nataliedurgin I experimented with several types of word embedding including fasttext embeddings in my thesis, but I did not manage to achieve any significant improvement on the QA datasets from SemEval 2017, task 3 (see Section 4.7.3, results are in tables 4.1, and 4.2).

Witiko commented 6 years ago

@piskvorky All the parts of the implementation can be improved as discussed in #1827. The softcossim function would be faster if it weren't for a suboptimal SciPy dot product implementation, the SoftCosineSimilarity.get_similarities method does not use the opportunity to compute similarities in two dot product operations, and the similarity_matrix method is subject to change, since the matrix building algorithm is greedy and inherently experimental. Additionally, as further similarity matrix builders are added, I imagine a separation of the matrix building algorithm, and the builder classes (such as EuclideanKeyedVectors) would simplify things.

If the above sounds too much like a toy / experimental, then perhaps it would be best to move the code outside what you call core Gensim. Mind that I did not withhold any of the above information when I filed the pull request and this is the first time a similar suggestion was raised, i.e. I was not even aware that there existed several release types for Gensim (core, experimental). I am sorry if this was an oversight on my part.

piskvorky / gensim