miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.
https://miso-belica.github.io/sumy/
Apache License 2.0
3.53k stars 530 forks source link

Unable to run sumy in Jupyter Notebook #217

Open azamsharpschool opened 3 months ago

azamsharpschool commented 3 months ago

I have been trying without success to get sumy to work in Jupyter Notebook. But it is always throwing error for the Tokenizer.

Here is my Jupyter Notebook code:

!python -c "import nltk; nltk.download('stopwords')"

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

text = "Your long text here..."
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, 3)  # Summarize to 3 sentences

for sentence in summary:
    print(sentence)

When I run this code I get the following error:


UnpicklingError                           Traceback (most recent call last)
Cell In[22], line 6
      3 from sumy.summarizers.lsa import LsaSummarizer
      5 text = "Your long text here..."
----> 6 parser = PlaintextParser.from_string(text, Tokenizer("english"))
      7 summarizer = LsaSummarizer()
      8 summary = summarizer(parser.document, 3)  # Summarize to 3 sentences

File ~/Desktop/sample_project/env/lib/python3.10/site-packages/sumy/nlp/tokenizers.py:160, in Tokenizer.__init__(self, language)
    157 self._language = language
    159 tokenizer_language = self.LANGUAGE_ALIASES.get(language, language)
--> 160 self._sentence_tokenizer = self._get_sentence_tokenizer(tokenizer_language)
    161 self._word_tokenizer = self._get_word_tokenizer(tokenizer_language)

File ~/Desktop/sample_project/env/lib/python3.10/site-packages/sumy/nlp/tokenizers.py:172, in Tokenizer._get_sentence_tokenizer(self, language)
    170 try:
    171     path = to_string("tokenizers/punkt/%s.pickle") % to_string(language)
--> 172     return nltk.data.load(path)
    173 except (LookupError, zipfile.BadZipfile) as e:
    174     raise LookupError(
    175         "NLTK tokenizers are missing or the language is not supported.\n"
    176         """Download them by following command: python -c "import nltk; nltk.download('punkt')"\n"""
    177         "Original error was:\n" + str(e)
    178     )

What can I do to fix this issue?

devsdenepal commented 3 days ago

Yep, I'm also getting module import error while using it on Jupyter Notebook: image

miso-belica commented 2 days ago

Hi, it may be related to this issue https://github.com/miso-belica/sumy/issues/216. Maybe try to download a new punkt_tab module.