piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.68k stars 4.38k forks source link

Error while summarizing text #805

Closed pedro-walter closed 7 years ago

pedro-walter commented 8 years ago

Hi,

I've received the following error when trying to summarize the body of this news article:

https://www.theguardian.com/media/2016/jun/19/sun-times-brexit-in-out-shake-it-all-about

The error follows:

File "/home/apps/comment_parser/venv/local/lib/python2.7/site-packages/gensim/summarization/summarizer.py", line 202, in summarize most_important_docs = summarize_corpus(corpus, ratio=ratio if word_count is None else 1) File "/home/apps/comment_parser/venv/local/lib/python2.7/site-packages/gensim/summarization/summarizer.py", line 161, in summarize_corpus pagerank_scores = _pagerank(graph) File "/home/apps/comment_parser/venv/local/lib/python2.7/site-packages/gensim/summarization/pagerank_weighted.py", line 24, in pagerank_weighted vals, vecs = eigs(pagerank_matrix.T, k=1) # TODO raise an error if matrix has complex eigenvectors? File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1271, in eigs ncv, v0, maxiter, which, tol) File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 685, in init raise ValueError("k must be less than ndim(A)-1, k=%d" % k) ValueError: k must be less than ndim(A)-1, k=1

Regards,

madewild commented 8 years ago

I am also affected by this bug...

tmylk commented 8 years ago

@madewild could you post a code snippet to reproduce?

madewild commented 8 years ago

Unfortunately the text fed to the summarizer is confidential, but my guess is that the error was triggered by an unusually high repetition of some sentences... I also notice now that the error raised was not exactly the same: ValueError: k must be less than rank(A)-1, k=1

MridulS commented 8 years ago

@tmylk The reason for this failure looks like the number of nodes in the graph that is used to calculate the pagerank of the corpus graph, after removing unreachable nodes the graph is left with only 2 nodes and hence it builds a matrix of shape 2 * 2 (for which scipy.sparse.linalg.eigs() will fail for k=1). We should probably raise an error if number of nodes (after removing unreachable nodes) goes below 3.

tmylk commented 8 years ago

@MridulS could you submit a pr for this?

MridulS commented 8 years ago

@tmylk What kind of error should I raise?

tmylk commented 8 years ago

The error should say "Please add more sentences to the text. The number of reachable nodes is below 3"

anmolgulati commented 8 years ago

Hi I worked on this issue. I have sent out a pull request for the same. Please review.

vitordouzi commented 7 years ago

Hello, I think that the problem is still open. I replicated this error with the document 1403 from the Hulth2003 dataset): from gensim.summarization import keywords print(keywords('IT: Utilities A look at five utilities to make your PCs more, efficient, effective, and efficacious'))

Traceback (most recent call last): File "\<stdin>", line 1, in File "/gensim/summarization/keywords.py", line 214, in keywords pagerank_scores = _pagerank(graph) File "/gensim/summarization/pagerank_weighted.py", line 24, in pagerank_weighted vals, vecs = eigs(pagerank_matrix.T, k=1) # TODO raise an error if matrix has complex eigenvectors? File "/anaconda3/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1299, in eigs ncv, v0, maxiter, which, tol) File "/anaconda3/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 692, in init raise ValueError("k must be less than ndim(A)-1, k=%d" % k) ValueError: k must be less than ndim(A)-1, k=1

Looking to the document, make sense say that the possible problems are the terms frequencies! All the terms have frequency equal 1.

piskvorky commented 7 years ago

Hmm, that's not good, looks like a bug.

Can you suggest a fix @vitordouzi ?

vitordouzi commented 7 years ago

@piskvorky, no, I don't! sorry! Maybe this TODO in the pagerank_weighted.py file can help.

File "/gensim/summarization/pagerank_weighted.py", line 24, in pagerank_weighted vals, vecs = eigs(pagerank_matrix.T, k=1) # TODO raise an error if matrix has complex eigenvectors?

What exactly are the complex eigenvectors?

xelez commented 7 years ago

Hello everyone. I started investigating this issue and basically, this is the same one as @MridulS described, but in different function:

The reason for this failure looks like the number of nodes in the graph that is used to calculate the pagerank of the corpus graph, after removing unreachable nodes the graph is left with only 2 nodes and hence it builds a matrix of shape 2 * 2 (for which scipy.sparse.linalg.eigs() will fail for k=1). We should probably raise an error if number of nodes (after removing unreachable nodes) goes below 3.

On text, given by @vitordouzi we end up with graph:

('effect', 'effici'), ('effici', 'effect'), ('effici', 'effici'), ('effect', 'effect')

which ends in 2x2 matrix and pagerank fails.

But I'm not sure how to fix this. @vitordouzi, @menshikh-iv any ideas on the desired outcome? An exception this time doesn't feel right. Maybe set some predefined scores instead of running pagerank? Or maybe add special case to pagerank?

xelez commented 7 years ago

Anyway, some notes about pagerank_weighted:

https://github.com/RaRe-Technologies/gensim/blob/9481915915bf61aa6e4e719a2f26d509677e6779/gensim/summarization/pagerank_weighted.py#L18-L26

  1. Matrix can have complex eigenvalues and eigenvector, but we a finding the largest eigenvalue, which is proven to be real. (because pagerank_matrix is positive and Perron–Frobenius theorem)
  2. Either assert about less than 3 vertices should be added, or, maybe, special handling of this case.
  3. It took me awhile to figure out why eigs() is used instead of eig() on a dense matrix. It was discussed in #441, #438. I'd like to add a comment about this.
menshikh-iv commented 7 years ago

About (1), (2) @xelez - need to handle special case, the comment from (3) should be useful too.