I'm doing keyword extraction in Norwegian. If I do not use Pattern, I'm getting stop words within the keyword extraction. E.g. if I extract the keywords from the first paragraph on Albert Einstein in the Norwegian Wikipedia:
Albert Einstein var en tyskfødt teoretisk fysiker og nobelprisvinner som er mest kjent for å ha formulert relativitetsteorien og vist at masse og energi er ekvivalente ved masseenergiloven, E = mc2. Gjennom den spesielle relativitetsteorien revolusjonerte han mekanikken og presiserte tidsbegrepet. Han var sentral i utviklingen av kvantemekanikken og er grunnleggeren av moderne kosmologi. Han regnes for å være en av de mest betydningsfulle vitenskapsmenn i det 20. århundre.
I'll get the following keywords:
i
og
han
hans
av
for å
ble
om
einstein var en
ved
som er mest
relativitetsteorien
det
fysikk
med
den
verden
verdens
enn
vitenskapelige
århundre
århundrets
person
første årene
professor
I, og, av, for, å, ble, om, etc. are stop words, and as such, the result is unusable.
When installing Pattern, I just get:
>>> from summa.summarizer import summarize
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\__init__.py", line 1, in <module>
from summa import commons, graph, keywords, pagerank_weighted, \
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\keywords.py", line 5, in <module>
from .preprocessing.textcleaner import clean_text_by_word as _clean_text_by_
word
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\summ
a\preprocessing\textcleaner.py", line 8, in <module>
from pattern.en import tag
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\__init__.py", line 61, in <module>
from pattern.text.en.inflect import (
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\__init__.py", line 80, in <module>
from pattern.text.en import wordnet
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\wordnet\__init__.py", line 57, in <module>
nltk.data.find("corpora/" + token)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 673, in find
return find(modified_name, paths)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 660, in find
return ZipFilePathPointer(p, zipentry)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\compat.py", line 228, in _decorator
return init_func(*args, **kwargs)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 506, in __init__
zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\compat.py", line 228, in _decorator
return init_func(*args, **kwargs)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk
\data.py", line 1055, in __init__
zipfile.ZipFile.__init__(self, filename)
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\..\..\..\..\zipfile.py", line 1222, in __init__
self._RealGetContents()
File "C:\Users\E\AppData\Local\Programs\Python\Python37\lib\site-packages\patt
ern\text\en\..\..\..\..\zipfile.py", line 1289, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
And so I cannot use Pattern (issue #30), making Norwegian unusable and unsupported. Assuming this goes for other languages as well.
I'm doing keyword extraction in Norwegian. If I do not use Pattern, I'm getting stop words within the keyword extraction. E.g. if I extract the keywords from the first paragraph on Albert Einstein in the Norwegian Wikipedia:
I'll get the following keywords:
I, og, av, for, å, ble, om, etc. are stop words, and as such, the result is unusable.
When installing Pattern, I just get:
And so I cannot use Pattern (issue #30), making Norwegian unusable and unsupported. Assuming this goes for other languages as well.