piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.7k stars 4.38k forks source link

Summarize calculates all eigenvectors when only the largest is needed #438

Closed erbas closed 8 years ago

erbas commented 9 years ago

The implementation of page rank calls eig, which calculates all the eigenvalues and eigenvectors. Unless I'm missing something, this line https://github.com/piskvorky/gensim/blob/develop/gensim/summarization/pagerank_weighted.py#L24

vals, vecs = eig(pagerank_matrix, left=True, right=False)

could be replaced with the sparse solver that only calculates the largest eigenvalue and corresponding eigenvector:

vals, vecs = eigs(pagerank_matrix, k=1)

Was there an issue using this sparse function? I notice the pagerank_maktrix is already in sparse form.

piskvorky commented 9 years ago

Yes, that's exactly what the # TODO optimize this. comment on that line means. We only need the principal eigenvector, AFAICS.

I'm not sure why @fbarrios didn't do that -- it certainly looks more efficient.

@fbarrios @fedelopez77 any comments? CC @olavurmortensen

fedelopez77 commented 9 years ago

We need the principal left eigenvector. I don't know how to get it using eigs, but with eig it can be done easily. According to wikipedia if we generate the trasposed matrix instead, we can use eigs setting k = 1. I think that would be the optimal way of doing it

erbas commented 9 years ago

Yes that should work. And be MUCH faster. Just call: eigs(pagerank_matrix.T, k=1)

piskvorky commented 9 years ago

Cool! @erbas can you implement & test this thoroughly?

Some notes:

  1. The pagerank_matrix should be symmetric, right? In that case, no transposing of pagerank_matrix is needed. I can't tell quickly if it's symmetric or not, looking at the build_adjacency_matrix code.
  2. Looking at the code, the pagerank_matrix is NOT sparse, so we'll have to be careful around shifting the non-sparse probability_matrix.
  3. Lastly, note that gensim itself has a highly scalable SVD solver for getting singular vectors (~eigenvectors). It uses an online algorithm, meaning the matrix columns can come in a stream, so the full matrix doesn't need to fit in memory. If we expect large inputs (~large pagerank_matrix dimensions) in text summarization, tweaking the summarization algo to use a streamed solver like that may be preferable.
erbas commented 9 years ago

Hi @piskvorky!

The pagerank_matrix does not appear to be symmetric, but it is sparse, just packed efficiently as a csr_matrix. However, the bad news is the eigenvector calculation is not the slow part of the code. These are timings on 10, 100, 200 lines of text after making the code change :(

Keirans-Retina-MacBook-Pro:test keiran$ time python test_summarization.py tmp.10
2015-08-26 17:10:57,555 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-08-26 17:10:57,558 : INFO : built Dictionary(574 unique tokens: [u'jaish', u'celebr', u'dollar', u'penrith', u'signific']...) from 88 documents (total 1004 corpus positions)
---> '_build_corpus' 0.0046 sec
---> '_set_graph_edge_weights' 0.0749 sec
---> 'build_adjacency_matrix' 0.0203 sec
---> 'build_probability_matrix' 0.0000 sec
---> 'process_results' 0.0002 sec
---> 'pagerank_weighted' 0.0237 sec
---> 'summarize_corpus' 0.1129 sec
---> '_get_important_sentences' 0.0001 sec
---> '_extract_important_sentences' 0.0001 sec
---> '_format_results' 0.0000 sec
---> 'summarize' 0.1391 sec

real    0m1.213s
user    0m0.962s
sys 0m0.261s
Keirans-Retina-MacBook-Pro:test keiran$ time python test_summarization.py tmp.100
2015-08-26 17:11:00,922 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-08-26 17:11:00,946 : INFO : built Dictionary(2732 unique tokens: [u'haggl', u'four', u'protest', u'jihad', u'sleep']...) from 843 documents (total 9844 corpus positions)
---> '_build_corpus' 0.0422 sec
---> '_set_graph_edge_weights' 22.5353 sec
---> 'build_adjacency_matrix' 2.3152 sec
---> 'build_probability_matrix' 0.0007 sec
---> 'process_results' 0.0019 sec
---> 'pagerank_weighted' 2.3550 sec
---> 'summarize_corpus' 26.4495 sec
---> '_get_important_sentences' 0.0009 sec
---> '_extract_important_sentences' 0.0009 sec
---> '_format_results' 0.0000 sec
---> 'summarize' 26.6937 sec

real    0m27.792s
user    0m27.200s
sys 0m0.594s
Keirans-Retina-MacBook-Pro:test keiran$ time python test_summarization.py tmp.200
2015-08-26 17:13:46,021 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-08-26 17:13:46,079 : INFO : built Dictionary(3934 unique tokens: [u'haggl', u'unfurl', u'four', u'protest', u'jihad']...) from 1759 documents (total 20772 corpus positions)
---> '_build_corpus' 0.0963 sec
---> '_set_graph_edge_weights' 178.0298 sec
---> 'build_adjacency_matrix' 11.2156 sec
---> 'build_probability_matrix' 0.0053 sec
---> 'process_results' 0.0037 sec
---> 'pagerank_weighted' 11.3815 sec
---> 'summarize_corpus' 196.3542 sec
---> '_get_important_sentences' 0.0027 sec
---> '_extract_important_sentences' 0.0027 sec
---> '_format_results' 0.0001 sec
---> 'summarize' 196.8886 sec

Building the adjacency matrix and setting the edge weights grow quite nonlinearly. Looking at the code there are quadratic loops in there: for i in length(nodes); for j in length(nodes)

Are there streaming graph implementations in other parts of gensim? I can think of a couple of pruning strategies for reducing the number of nodes, and streaming versions of pagerank have been developed. So probably we can afford to do a multi-pass, row at a time algo.

best, Keiran

piskvorky commented 9 years ago

What makes you think it's sparse? I don't think that's true (eig wouldn't even know how to handle sparse input).

Thanks for the timings, that will be super useful during optimizations! Regardless, I'd change the code that computes all eigenvalues to then discard all but one... there's no reason to do that. But optimization efforts should aim at summarize_corpus, clearly.

erbas commented 9 years ago

Well eigs is the sparse version of eig, but just insert a print pagerank_matrix.toarray() and look at the result. I saw lots of zeroes! Intuitively, not all words are related so the graph should be less than fully connected.

On 26 Aug 2015, at 7:00 pm, Radim Řehůřek notifications@github.com wrote:

What makes you think it's sparse? I don't think that's true (eig wouldn't even know how to handle sparse input).

Thanks for the timings, that will be super useful during optimizations! Regardless, I'd change the code that computes all eigenvalues to then discard all but one... there's no reason to do that.

— Reply to this email directly or view it on GitHub.

piskvorky commented 9 years ago

Yes, maybe the content is sparse (though not even that, after subtracting the dense probability_matrix -- that's what I meant by "careful about the shift" -- it ruins the sparsity). The matrix format definitely isn't sparse though. Specifically, pagerank_matrix is not a CSR matrix, but rather a full (dense) numpy array.

erbas commented 9 years ago

Sorry, misread your comment on my phone. I bet scipy just unpacks it if it's sparse. Will check the source when I get off the bus. And I'll have a go at streaming pagerank.

piskvorky commented 9 years ago

Hmm, how about we optimize summarize_corpus instead? I mean, computing a single eigenvector is enough. According to your timings, summarize_corpus is where the bottleneck is.

I haven't inspected that function but I assume it's better ROI than rewriting the algo to streaming pagerank :)

erbas commented 9 years ago

Sorry, you're right! I had looked earlier in the process when just the adjacency matrix was being constructed.

And apologies also for the misleading timings, summarize_corpus is the second from top level function. The bottleneck is _set_graph_edge_weights which it calls. By "streaming pagerank" I mean avoiding constructing the full graph in memory.

piskvorky commented 9 years ago

Ah, OK, then we're on again.

I'd have to study the source to make more qualified comments, but I suspect that if the algorithm is fundamentally quadratic (as @olavurmortensen said), then all we can hope for is improving the constant factors, by speeding up a function here or there.

fedelopez77 commented 9 years ago

Both _set_graph_edge_weights and build_adjacency_matrix are quadratics, but the first is much slower probably due to _bm25_weights. The BM25 calculations are applied there, iterating through every word of every document. I think that should be the first place to start optimizing.

erbas commented 9 years ago

So by skipping small weights in _set_edge_weights you can get a factor of 4 speedup. I didn't see anything obvious in _bm25_weights, would need to think more about the calculation:

Keirans-Retina-MacBook-Pro:test keiran$ time python test_summarization.py tmp.200
2015-08-27 16:04:11,032 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-08-27 16:04:11,082 : INFO : built Dictionary(3934 unique tokens: [u'haggl', u'unfurl', u'four', u'protest', u'jihad']...) from 1759 documents (total 20772 corpus positions)
---> '_build_corpus' 0.0880 sec
---> 'build_graph' 0.0030 sec
---> 'initialize' 0.0213 sec
---> '__init__' 0.0219 sec
---> 'get_bm25_weights' 6.0773 sec
---> '_set_graph_edge_weights' 13.7971 sec
---> 'remove_unreachable_nodes' 0.7418 sec
---> 'build_adjacency_matrix' 7.2672 sec
---> 'build_probability_matrix' 0.0052 sec
---> 'process_results' 0.0044 sec
---> 'pagerank_weighted' 7.4623 sec
---> 'summarize_corpus' 23.5241 sec
---> '_get_important_sentences' 0.0030 sec
---> '_get_sentences_with_word_count' 0.0000 sec
---> '_extract_important_sentences' 0.0030 sec
---> '_format_results' 0.0000 sec
---> 'summarize' 24.0451 sec
erbas commented 9 years ago

Hey @piskvorky, I don't seem to have permission to submit a pull request.

piskvorky commented 9 years ago

I don't think you need any permissions to submit pull requests to public repositories :)

What exactly was the error, and where did it appear?

erbas commented 9 years ago

Sorry, n00b git error. All submitted now. I didn't bump the version number though.

erbas commented 9 years ago

@piskvorky I've written keyword and summarisation modules which use cosine similarity of word vectors rather than bm25 weights. Results look pretty good. What's the gensim commit/test/deploy model for functions that require a set of trained vectors? I've just been using the Levy and Goldberg training vectors https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ and hosted here http://u.cs.biu.ac.il/~yogo/data/syntemb/bow5.words.bz2

piskvorky commented 9 years ago

There is no functionality in gensim that depends on some pre-trained models now.

Maybe the closest are 3rd party lib wrappers (Vowpal Wabbit, Mallet etc). There we write unit tests which only execute if a certain env var is set (MALLET_HOME, pointing to some external executable).

If the data files are large, I'm -1 on committing them to git. We'll have to ask users to supply the path to such files as runtime parameters (or optional env vars in unit tests).

I hope I understood your question correctly :)

fedelopez77 commented 9 years ago

@erbas as you can check on the article we wrote, we already made several tests with the cosine similarity distance. It does not work as well as BM25 in automatic summarization tasks, but it may be good enough here. Also, you get a symmetric matrix, and therefore an undirected graph can be applied.

erbas commented 9 years ago

@fedelopez77 Thanks for the link, I can't believe I hadn't seen that already. But if I'm reading it correctly you didn't test cosine similarity with word2vec vectors? My limited experiments show that for short samples, like reviews or chat messages. the deeply learned word vector embeddings give much more useful keywords and extractive summaries than the BM25/tf-idf approach. From my brief perusal of the literature I wonder if the ROUGE tests are not well suited to this kind of text? Intuitively I would expect this to be the case because pre-trained embeddings bring extra information about the language to bear on a problem where you have only small amounts of data. Note that for summarisation I'm using the average of the word vectors for the sentence to get the vector for the sentence, which is the graph node, and that taking the cosine similarity between these average vectors for the edge weight. I don't see a problem with the graph being undirected, but I am a relative newbie to text analytics.

erbas commented 9 years ago

@piskvorky yes you did and thank you, that makes sense, just wanted to make sure we were on the same page. Will commit accordingly.

fedelopez77 commented 9 years ago

@erbas we used the cosine similarity without word2vec. We modeled each sentence as a n-dimensional vector, with n the number of different words in the text. The components of the vectors are the results of the tf-idf function, applied on the text. ROUGE metrics compare the resulting summary with "golden" summaries made by humans. That score is how "good" your summary is. It should not matter if you are applying a pre-trained method or not. As I said, we did not run tests with word2vec, but it could be very interesting to check those results. Anyway, remember that one of the important attributes of TextRank is that it does not need pre-trained data, so if you get better results, I think a trade off should be made to see if you add this implementation.

tmylk commented 8 years ago

Implemented in https://github.com/piskvorky/gensim/pull/441