Open Manamama opened 5 months ago
Hi, thank you for the issue. If NLTK support is not good enough maybe it would be better to add the support you are suggesting into NLTK. WDYT?
I have seen that you have stemmers in your code for Slovak, Greek etc. We had better add Polish there, instead.
(BTW, I know next to nothing about such architecture, I have just been hacking here... )
Problem:
The sumy module uses the nltk package for stemming and stop words, but nltk does not support e.g. the Polish language out of the box.
Solution:
Stop words:
Download the Polish stop words file from e.g. here, rename it to
polish.txt
, and place it in the sumy stop words directory (~/.local/lib/python3.10/site-packages/sumy/data/stopwords/polish.txt
).Stemming:
Use the pystempel package, which provides a stemmer for the Polish language. Here’s the code:
Then in this section, in the handle_arguments function, replace the line where the stemmer is created with a call to get_stemmer:
This way, if the language is Polish, get_stemmer will return a CallableStemmer that wraps a StempelStemmer. For any other language, it will return the original Stemmer.
Credit for most of the code:
MS Copilot aka Bing