miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.
https://miso-belica.github.io/sumy/
Apache License 2.0
3.53k stars 530 forks source link

Some questions about the textrank #87

Closed Zhujunnan closed 6 years ago

Zhujunnan commented 7 years ago

Hi, I have a question about the textrank module. As I know, the textrank is based on the pagerank algorithm. However, in the text_rank.py file, I just see the code which builds edges between sentences and don't seem to use iterative solution to calculate it. I don't know if I understand correctly, I am looking forward to your answer. Thx!

kariminf commented 7 years ago

Hi, In this paper it is stated that texRank uses iterations till it converges.

When seeing the code I had the same question.

Regards

miso-belica commented 7 years ago

Hi, guys this is tough question for me. I hough I'll find time to look at the code more deeply to answer but I guess I am just too naive. Honestly I don't remember the source paper for TextRank implementation and what is the worst there is no URL to some paper in docstring in class, just URL to some other repo. I found the commit https://github.com/miso-belica/sumy/commit/80f7dfa7ce3c7e9ce9319fa7c45c06af7bb3c4fa and seems suspicious to me :/ When I was implementing sumy I read many papers and it's possible I mixed them somehow. What means I am lying all the time about this method :( But because Python is quite high level it may be that this iterative process is hidden somewhere in the high-level function. I really need to check the code and then I can give you the real answer. These are my guesses and possibilities to just write something. But you know my time with sumy is very limited for a long time so...

kariminf commented 7 years ago

Hi, Even if it is not the method in that paper, it did a good job compared to other methods when tested on MultiLing2015 training corpus. Here are the results:

Peer Rouge1-R Rouge1-P Rouge-F Rouge2-R Rouge2-P Rouge2-F
KLSummarizer 0.32745 0.34508 0.33585 0.06317 0.06666 0.06483
LexRankSummarizer 0.37926 0.39425 0.38621 0.09350 0.09712 0.09518
LsaSummarizer 0.34674 0.37187 0.35832 0.07678 0.08220 0.07929
LuhnSummarizer 0.35671 0.39231 0.37300 0.08575 0.09404 0.08954
RandomSummarizer 0.35968 0.37472 0.36676 0.07961 0.08262 0.08102
SumBasicSummarizer 0.36909 0.37383 0.37110 0.07683 0.07798 0.07732
TextRankSummarizer 0.37688 0.40179 0.38864 0.09852 0.10471 0.10145

So, thank you for this great job

kmkurn commented 6 years ago

Hi, I'm looking at the TextRank code and yes it seems that it really doesn't run PageRank as described in (Mihalcea and Tarau, 2004), the original paper. I don't think the iterative process is hidden somewhere either. It may perform well, but claiming to be TextRank doesn't feel right to me.

I might be able to provide a fix. Would you be interested in a PR for this?

miso-belica commented 6 years ago

I am happy to inform you that @kmkurn provided new implementation for TextRank based on original paper in https://github.com/miso-belica/sumy/pull/100. So I am closing this issue. Feel free to create new one if needed or send PR with any proposal :) Thank you all