I just encountered a bug with the SumBasicSummarizer, where it seems that the method looks up the document frequency of a stemmed word. However, the actual word_freq_in_doc dictionary only stores the frequencies for unstemmed words.
In particular, I believe that the culprit is the different normalization of content words between _get_content_words_in_sentence() versus the normalization in _get_all_content_words_in_doc(). In particular, the former method performs stemming, whereas the latter does not.
I would have proposed a PR myself, but I don't know which is the "more correct" fix (IMO, consistent stemming should be the way to go?).
FWIW, I used this with German texts, although capitalization etc. seems to be no issue here.
Hey, first of all, thanks for the great library!
I just encountered a bug with the
SumBasicSummarizer
, where it seems that the method looks up the document frequency of a stemmed word. However, the actualword_freq_in_doc
dictionary only stores the frequencies for unstemmed words.In particular, I believe that the culprit is the different normalization of content words between
_get_content_words_in_sentence()
versus the normalization in_get_all_content_words_in_doc()
. In particular, the former method performs stemming, whereas the latter does not.I would have proposed a PR myself, but I don't know which is the "more correct" fix (IMO, consistent stemming should be the way to go?).
FWIW, I used this with German texts, although capitalization etc. seems to be no issue here.