miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.
https://miso-belica.github.io/sumy/
Apache License 2.0
3.46k stars 525 forks source link

Tip: how to make it summarize mid-tail languages, e.g. Polish #204

Open Manamama opened 5 months ago

Manamama commented 5 months ago

Problem:

The sumy module uses the nltk package for stemming and stop words, but nltk does not support e.g. the Polish language out of the box.

Solution:

Stop words:

Download the Polish stop words file from e.g. here, rename it to polish.txt, and place it in the sumy stop words directory (~/.local/lib/python3.10/site-packages/sumy/data/stopwords/polish.txt).

Stemming:

Use the pystempel package, which provides a stemmer for the Polish language. Here’s the code:

from stempel import StempelStemmer
class CallableStemmer:
    def __init__(self, stemmer):
        self.stemmer = stemmer

    def __call__(self, word):
        return self.stemmer.stem(word)

def get_stemmer(language):
    if language == 'pol':
        # Create a StempelStemmer object for Polish
        stemmer_obj = StempelStemmer.default()
        # Wrap it in a CallableStemmer
        return CallableStemmer(stemmer_obj)
    else:
        # For non-Polish languages, use the original Stemmer
        return Stemmer(language)

Then in this section, in the handle_arguments function, replace the line where the stemmer is created with a call to get_stemmer:

def handle_arguments(args, default_input_stream=sys.stdin):
    # ... (other code) ...

    language = args["--language"]
    if args["--stopwords"]:
        stop_words = read_stop_words(args["--stopwords"])
    else:
        stop_words = get_stop_words(language)

    parser = parser(document_content, Tokenizer(language))
    stemmer = get_stemmer(language)

    # ... (other code) ...

This way, if the language is Polish, get_stemmer will return a CallableStemmer that wraps a StempelStemmer. For any other language, it will return the original Stemmer.

Credit for most of the code: MS Copilot aka Bing

miso-belica commented 5 months ago

Hi, thank you for the issue. If NLTK support is not good enough maybe it would be better to add the support you are suggesting into NLTK. WDYT?

Manamama commented 5 months ago

I have seen that you have stemmers in your code for Slovak, Greek etc. We had better add Polish there, instead.

(BTW, I know next to nothing about such architecture, I have just been hacking here... )