Closed erbas closed 8 years ago
Yes, that's exactly what the # TODO optimize this.
comment on that line means. We only need the principal eigenvector, AFAICS.
I'm not sure why @fbarrios didn't do that -- it certainly looks more efficient.
@fbarrios @fedelopez77 any comments? CC @olavurmortensen
We need the principal left eigenvector. I don't know how to get it using eigs
, but with eig
it can be done easily.
According to wikipedia if we generate the trasposed matrix instead, we can use eigs
setting k = 1. I think that would be the optimal way of doing it
Yes that should work. And be MUCH faster. Just call:
eigs(pagerank_matrix.T, k=1)
Cool! @erbas can you implement & test this thoroughly?
Some notes:
pagerank_matrix
should be symmetric, right? In that case, no transposing of pagerank_matrix
is needed. I can't tell quickly if it's symmetric or not, looking at the build_adjacency_matrix code.pagerank_matrix
is NOT sparse, so we'll have to be careful around shifting the non-sparse probability_matrix
.pagerank_matrix
dimensions) in text summarization, tweaking the summarization algo to use a streamed solver like that may be preferable.Hi @piskvorky!
The pagerank_matrix
does not appear to be symmetric, but it is sparse, just packed efficiently as a csr_matrix
. However, the bad news is the eigenvector calculation is not the slow part of the code. These are timings on 10, 100, 200 lines of text after making the code change :(
Keirans-Retina-MacBook-Pro:test keiran$ time python test_summarization.py tmp.10
2015-08-26 17:10:57,555 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-08-26 17:10:57,558 : INFO : built Dictionary(574 unique tokens: [u'jaish', u'celebr', u'dollar', u'penrith', u'signific']...) from 88 documents (total 1004 corpus positions)
---> '_build_corpus' 0.0046 sec
---> '_set_graph_edge_weights' 0.0749 sec
---> 'build_adjacency_matrix' 0.0203 sec
---> 'build_probability_matrix' 0.0000 sec
---> 'process_results' 0.0002 sec
---> 'pagerank_weighted' 0.0237 sec
---> 'summarize_corpus' 0.1129 sec
---> '_get_important_sentences' 0.0001 sec
---> '_extract_important_sentences' 0.0001 sec
---> '_format_results' 0.0000 sec
---> 'summarize' 0.1391 sec
real 0m1.213s
user 0m0.962s
sys 0m0.261s
Keirans-Retina-MacBook-Pro:test keiran$ time python test_summarization.py tmp.100
2015-08-26 17:11:00,922 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-08-26 17:11:00,946 : INFO : built Dictionary(2732 unique tokens: [u'haggl', u'four', u'protest', u'jihad', u'sleep']...) from 843 documents (total 9844 corpus positions)
---> '_build_corpus' 0.0422 sec
---> '_set_graph_edge_weights' 22.5353 sec
---> 'build_adjacency_matrix' 2.3152 sec
---> 'build_probability_matrix' 0.0007 sec
---> 'process_results' 0.0019 sec
---> 'pagerank_weighted' 2.3550 sec
---> 'summarize_corpus' 26.4495 sec
---> '_get_important_sentences' 0.0009 sec
---> '_extract_important_sentences' 0.0009 sec
---> '_format_results' 0.0000 sec
---> 'summarize' 26.6937 sec
real 0m27.792s
user 0m27.200s
sys 0m0.594s
Keirans-Retina-MacBook-Pro:test keiran$ time python test_summarization.py tmp.200
2015-08-26 17:13:46,021 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-08-26 17:13:46,079 : INFO : built Dictionary(3934 unique tokens: [u'haggl', u'unfurl', u'four', u'protest', u'jihad']...) from 1759 documents (total 20772 corpus positions)
---> '_build_corpus' 0.0963 sec
---> '_set_graph_edge_weights' 178.0298 sec
---> 'build_adjacency_matrix' 11.2156 sec
---> 'build_probability_matrix' 0.0053 sec
---> 'process_results' 0.0037 sec
---> 'pagerank_weighted' 11.3815 sec
---> 'summarize_corpus' 196.3542 sec
---> '_get_important_sentences' 0.0027 sec
---> '_extract_important_sentences' 0.0027 sec
---> '_format_results' 0.0001 sec
---> 'summarize' 196.8886 sec
Building the adjacency matrix and setting the edge weights grow quite nonlinearly. Looking at the code there are quadratic loops in there: for i in length(nodes); for j in length(nodes)
Are there streaming graph implementations in other parts of gensim? I can think of a couple of pruning strategies for reducing the number of nodes, and streaming versions of pagerank have been developed. So probably we can afford to do a multi-pass, row at a time algo.
best, Keiran
What makes you think it's sparse? I don't think that's true (eig
wouldn't even know how to handle sparse input).
Thanks for the timings, that will be super useful during optimizations! Regardless, I'd change the code that computes all eigenvalues to then discard all but one... there's no reason to do that. But optimization efforts should aim at summarize_corpus
, clearly.
Well eigs
is the sparse version of eig
, but just insert a print pagerank_matrix.toarray()
and look at the result. I saw lots of zeroes! Intuitively, not all words are related so the graph should be less than fully connected.
On 26 Aug 2015, at 7:00 pm, Radim Řehůřek notifications@github.com wrote:
What makes you think it's sparse? I don't think that's true (eig wouldn't even know how to handle sparse input).
Thanks for the timings, that will be super useful during optimizations! Regardless, I'd change the code that computes all eigenvalues to then discard all but one... there's no reason to do that.
— Reply to this email directly or view it on GitHub.
Yes, maybe the content is sparse (though not even that, after subtracting the dense probability_matrix
-- that's what I meant by "careful about the shift" -- it ruins the sparsity).
The matrix format definitely isn't sparse though. Specifically, pagerank_matrix
is not a CSR matrix, but rather a full (dense) numpy array.
Sorry, misread your comment on my phone. I bet scipy just unpacks it if it's sparse. Will check the source when I get off the bus. And I'll have a go at streaming pagerank.
Hmm, how about we optimize summarize_corpus
instead? I mean, computing a single eigenvector is enough. According to your timings, summarize_corpus
is where the bottleneck is.
I haven't inspected that function but I assume it's better ROI than rewriting the algo to streaming pagerank :)
Sorry, you're right! I had looked earlier in the process when just the adjacency matrix was being constructed.
And apologies also for the misleading timings, summarize_corpus
is the second from top level function. The bottleneck is _set_graph_edge_weights
which it calls. By "streaming pagerank" I mean avoiding constructing the full graph in memory.
Ah, OK, then we're on again.
I'd have to study the source to make more qualified comments, but I suspect that if the algorithm is fundamentally quadratic (as @olavurmortensen said), then all we can hope for is improving the constant factors, by speeding up a function here or there.
Both _set_graph_edge_weights
and build_adjacency_matrix
are quadratics, but the first is much slower probably due to _bm25_weights
. The BM25 calculations are applied there, iterating through every word of every document. I think that should be the first place to start optimizing.
So by skipping small weights in _set_edge_weights
you can get a factor of 4 speedup. I didn't see anything obvious in _bm25_weights
, would need to think more about the calculation:
Keirans-Retina-MacBook-Pro:test keiran$ time python test_summarization.py tmp.200
2015-08-27 16:04:11,032 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-08-27 16:04:11,082 : INFO : built Dictionary(3934 unique tokens: [u'haggl', u'unfurl', u'four', u'protest', u'jihad']...) from 1759 documents (total 20772 corpus positions)
---> '_build_corpus' 0.0880 sec
---> 'build_graph' 0.0030 sec
---> 'initialize' 0.0213 sec
---> '__init__' 0.0219 sec
---> 'get_bm25_weights' 6.0773 sec
---> '_set_graph_edge_weights' 13.7971 sec
---> 'remove_unreachable_nodes' 0.7418 sec
---> 'build_adjacency_matrix' 7.2672 sec
---> 'build_probability_matrix' 0.0052 sec
---> 'process_results' 0.0044 sec
---> 'pagerank_weighted' 7.4623 sec
---> 'summarize_corpus' 23.5241 sec
---> '_get_important_sentences' 0.0030 sec
---> '_get_sentences_with_word_count' 0.0000 sec
---> '_extract_important_sentences' 0.0030 sec
---> '_format_results' 0.0000 sec
---> 'summarize' 24.0451 sec
Hey @piskvorky, I don't seem to have permission to submit a pull request.
I don't think you need any permissions to submit pull requests to public repositories :)
What exactly was the error, and where did it appear?
Sorry, n00b git error. All submitted now. I didn't bump the version number though.
@piskvorky I've written keyword and summarisation modules which use cosine similarity of word vectors rather than bm25 weights. Results look pretty good. What's the gensim commit/test/deploy model for functions that require a set of trained vectors? I've just been using the Levy and Goldberg training vectors https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ and hosted here http://u.cs.biu.ac.il/~yogo/data/syntemb/bow5.words.bz2
There is no functionality in gensim that depends on some pre-trained models now.
Maybe the closest are 3rd party lib wrappers (Vowpal Wabbit, Mallet etc). There we write unit tests which only execute if a certain env var is set (MALLET_HOME
, pointing to some external executable).
If the data files are large, I'm -1 on committing them to git. We'll have to ask users to supply the path to such files as runtime parameters (or optional env vars in unit tests).
I hope I understood your question correctly :)
@erbas as you can check on the article we wrote, we already made several tests with the cosine similarity distance. It does not work as well as BM25 in automatic summarization tasks, but it may be good enough here. Also, you get a symmetric matrix, and therefore an undirected graph can be applied.
@fedelopez77 Thanks for the link, I can't believe I hadn't seen that already. But if I'm reading it correctly you didn't test cosine similarity with word2vec vectors? My limited experiments show that for short samples, like reviews or chat messages. the deeply learned word vector embeddings give much more useful keywords and extractive summaries than the BM25/tf-idf approach. From my brief perusal of the literature I wonder if the ROUGE tests are not well suited to this kind of text? Intuitively I would expect this to be the case because pre-trained embeddings bring extra information about the language to bear on a problem where you have only small amounts of data. Note that for summarisation I'm using the average of the word vectors for the sentence to get the vector for the sentence, which is the graph node, and that taking the cosine similarity between these average vectors for the edge weight. I don't see a problem with the graph being undirected, but I am a relative newbie to text analytics.
@piskvorky yes you did and thank you, that makes sense, just wanted to make sure we were on the same page. Will commit accordingly.
@erbas we used the cosine similarity without word2vec. We modeled each sentence as a n-dimensional vector, with n the number of different words in the text. The components of the vectors are the results of the tf-idf function, applied on the text. ROUGE metrics compare the resulting summary with "golden" summaries made by humans. That score is how "good" your summary is. It should not matter if you are applying a pre-trained method or not. As I said, we did not run tests with word2vec, but it could be very interesting to check those results. Anyway, remember that one of the important attributes of TextRank is that it does not need pre-trained data, so if you get better results, I think a trade off should be made to see if you add this implementation.
Implemented in https://github.com/piskvorky/gensim/pull/441
The implementation of page rank calls eig, which calculates all the eigenvalues and eigenvectors. Unless I'm missing something, this line https://github.com/piskvorky/gensim/blob/develop/gensim/summarization/pagerank_weighted.py#L24
vals, vecs = eig(pagerank_matrix, left=True, right=False)
could be replaced with the sparse solver that only calculates the largest eigenvalue and corresponding eigenvector:
vals, vecs = eigs(pagerank_matrix, k=1)
Was there an issue using this sparse function? I notice the pagerank_maktrix is already in sparse form.