miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.
https://miso-belica.github.io/sumy/
Apache License 2.0
3.46k stars 525 forks source link

how to remove sentences from ODM #185

Closed fredzannarbor closed 1 year ago

fredzannarbor commented 1 year ago

Hi,

I want to preprocess certain tokenized sentences before submitting them to the summarizer. For example I would like to be ab le to remove any sentence that contains five consecutive periods (these are often 'noisy' ToC lines).

parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LexRankSummarizer()
summarizer.stop_words = get_stop_words("english")
summary = summarizer(parser.document, sentences_count)
summary_text = '\n'.join([str(sentence) for sentence in summary])

So I want to insert something like this pseudocode before the "summarizer" line:

for s in parser.document.sentences:
   if s.str.contains("...."):
        s.remove()

Of course, this doesn't work because the ODM is not iterable. So how do I iterate through the components of the document and remove or edit them as I see fit?

miso-belica commented 1 year ago

Hello, DOM is just an object consisting of paragraphs and sentences. You can filter sentences out and create a new one if you want.

paragraphs = []
for p in parser.document.paragraphs:
   paragraphs.append([s for s in p.sentences if not str(s).contains("....")])

dom = ObjectDocumentModel(paragraphs)

You have to cover edge case as if you remove all sentences from paragraph maybe. But maybe even empty paragraphs will work.

fredzannarbor commented 1 year ago

Thank you. I did not understand how to reconstitute the dom from the constituents.

fredzannarbor commented 1 year ago

OK, one more obstacle.

 parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LexRankSummarizer()
    summarizer.stop_words = get_stop_words("english")
    print(len(parser.document.sentences))
    paragraphs = []
    drops = []
    #print(len(parser.document.paragraphs))
    for paragraph in parser.document.paragraphs:
        #print(len(paragraph.sentences))
        for sentence in paragraph.sentences:

            if "......" in str(sentence):
                drops.append(sentence)
            else:
                paragraphs.append(sentence)

    print(len(drops), len(paragraphs))
    dom = ObjectDocumentModel(paragraphs)
    print(len(dom.paragraphs))

    summary = summarizer(dom, sentences_count)

The extra code is just to make sure that the filter is dropping the problem sentences, and the keeps & drops add up correctly. But when I try to summarize the filtered dom, it throws an error.

1746
9 1737
1737
Traceback (most recent call last):
  File "app/utilities/text2sumy_summarize.py", line 53, in <module>
    result = sumy_summarize(text, sentences_count=args.sentences_count)
  File "app/utilities/text2sumy_summarize.py", line 32, in sumy_summarize
    summary = summarizer(dom, sentences_count)
  File "/Users/fred/.virtualenvs/pycharmed-unity/lib/python3.8/site-packages/sumy/summarizers/lex_rank.py", line 36, in __call__
    sentences_words = [self._to_words_set(s) for s in document.sentences]
  File "/Users/fred/.virtualenvs/pycharmed-unity/lib/python3.8/site-packages/sumy/utils.py", line 53, in decorator
    setattr(self, key, getter(self))
  File "/Users/fred/.virtualenvs/pycharmed-unity/lib/python3.8/site-packages/sumy/models/dom/_document.py", line 23, in sentences
    return tuple(chain(*sentences))
  File "/Users/fred/.virtualenvs/pycharmed-unity/lib/python3.8/site-packages/sumy/models/dom/_document.py", line 22, in <genexpr>
    sentences = (p.sentences for p in self._paragraphs)
AttributeError: 'Sentence' object has no attribute 'sentences'
miso-belica commented 1 year ago

The bug is on this line paragraphs.append(sentence) 😉