miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.
https://miso-belica.github.io/sumy/
Apache License 2.0
3.51k stars 529 forks source link

`KeyError` in `SumBasicSummarizer` #165

Closed dennlinger closed 2 years ago

dennlinger commented 2 years ago

Hey, first of all, thanks for the great library!

I just encountered a bug with the SumBasicSummarizer, where it seems that the method looks up the document frequency of a stemmed word. However, the actual word_freq_in_doc dictionary only stores the frequencies for unstemmed words.

In particular, I believe that the culprit is the different normalization of content words between _get_content_words_in_sentence() versus the normalization in _get_all_content_words_in_doc(). In particular, the former method performs stemming, whereas the latter does not.

I would have proposed a PR myself, but I don't know which is the "more correct" fix (IMO, consistent stemming should be the way to go?).

FWIW, I used this with German texts, although capitalization etc. seems to be no issue here.

miso-belica commented 2 years ago

Thank you for the report and the detailed analysis 🙂