miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.
https://miso-belica.github.io/sumy/
Apache License 2.0
3.51k stars 529 forks source link

Edmundson summarizer #21

Open nick-magnini opened 9 years ago

nick-magnini commented 9 years ago

Hi,

I checked the code for Edmundson summarizer. As I figured out it doesn't do anything for English. Basically it suppose to extract cue words and significant words and the words in title and rank the sentences based in these scores and the location. Well, when the input is a raw text file, then the summarizer works based on the location of the sentence. Is that right? There is no method to extract the cue words and significant words as well as title words for the text. So in this way the implementation is wrong I suppose. Let me know if I did not understand your code or I'm making a mistake? Thanks.

nick-magnini commented 9 years ago

I realized that even the location in Edmundson doesn't work when the input document is a raw text document in one sentence per line format.

miso-belica commented 9 years ago

Hi, I suppose some format of "plain text". But I'm not sure if I understand you. Can you give an example of the text? And what does "it doesn't do anything for English" means? It means that for other languages summarizer works correctly? And what do you suggest? How do you think should the summarized behave?

nick-magnini commented 9 years ago

Hi,

Well, it does give the output but it's not based on the Edmundson algorithm. Basically the list of cue words and significant words are the non_english version which is in the parser/parse.py:

SIGNIFICANT_WORDS = ( "významný", "vynikající", "podstatný", "význačný", "důležitý", "slavný", "zajímavý", "eminentní", "vlivný", "supr", "super", "nejlepší", "dobrý", "kvalitní", "optimální", "relevantní", ) STIGMA_WORDS = ( "nejhorší", "zlý", "šeredný", )

Which is called from the main:

if summarizer_class is EdmundsonSummarizer: summarizer.null_words = stop_words summarizer.bonus_words = parser.significant_words summarizer.stigma_words = parser.stigma_words

So when the Edmundson summarizer for English is called, the it will go not find any significant/stigma words in English. In the document is one sentence per line, the location class will not give the correct output for the edmundson_location.py as well. So the Edmundson method will get totally wrong inputs. Correct me if I'm wrong.

miso-belica commented 9 years ago

Yes, you are absolutely right. I totally forget about it. I tested summarizers with Czech texts and let it there. This should be fixed. Thanks a lot for this :)

But as I remember there is no method for gathering stigma/bonus words from the text. They should be provided based on the language like stop-words are.

nick-magnini commented 9 years ago

Ok, we should then think about it then. stigma/bonus words should be extracted from the summarizing text. A general one will not help. It can be done using various methods such as topic extraction, phrase extraction, ... We can work on it. I'll come with some modules and points on that soon.

nick-magnini commented 9 years ago

Also regarding the location, it should be fixed in the edmundson_location.py