summanlp / textrank

TextRank implementation for Python 3.
https://pypi.org/project/summa/
MIT License
1.25k stars 260 forks source link

Norwegian is not supported #65

Open emilmuller opened 5 years ago

emilmuller commented 5 years ago

I'm doing keyword extraction in Norwegian. If I do not use Pattern, I'm getting stop words within the keyword extraction. E.g. if I extract the keywords from the first paragraph on Albert Einstein in the Norwegian Wikipedia:

Albert Einstein var en tyskfødt teoretisk fysiker og nobelprisvinner som er mest kjent for å ha formulert relativitetsteorien og vist at masse og energi er ekvivalente ved masseenergiloven, E = mc2. Gjennom den spesielle relativitetsteorien revolusjonerte han mekanikken og presiserte tidsbegrepet. Han var sentral i utviklingen av kvantemekanikken og er grunnleggeren av moderne kosmologi. Han regnes for å være en av de mest betydningsfulle vitenskapsmenn i det 20. århundre.

I'll get the following keywords:

I, og, av, for, å, ble, om, etc. are stop words, and as such, the result is unusable.

When installing Pattern, I just get:

>>> from summa.summarizer import summarize
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\__init__.py", line 1, in <module>
    from summa import commons, graph, keywords, pagerank_weighted, \
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\keywords.py", line 5, in <module>
    from .preprocessing.textcleaner import clean_text_by_word as _clean_text_by_
word
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\preprocessing\textcleaner.py", line 8, in <module>
    from pattern.en import tag
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\__init__.py", line 61, in <module>
    from pattern.text.en.inflect import (
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\__init__.py", line 80, in <module>
    from pattern.text.en import wordnet
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\wordnet\__init__.py", line 57, in <module>
    nltk.data.find("corpora/" + token)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 673, in find
    return find(modified_name, paths)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 660, in find
    return ZipFilePathPointer(p, zipentry)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\compat.py", line 228, in _decorator
    return init_func(*args, **kwargs)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 506, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\compat.py", line 228, in _decorator
    return init_func(*args, **kwargs)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 1055, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\..\..\..\..\zipfile.py", line 1222, in __init__
    self._RealGetContents()
  File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\..\..\..\..\zipfile.py", line 1289, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

And so I cannot use Pattern (issue #30), making Norwegian unusable and unsupported. Assuming this goes for other languages as well.