C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\Scripts\python.exe C:/Users/fabiansvenkarst/Documents/BA/Wiki_py27/Wiki_corpus_Verarbeitung.py
C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2018-05-12 04:38:51,516 : INFO : running C:/Users/fabiansvenkarst/Documents/BA/Wiki_py27/Wiki_corpus_Verarbeitung.py
Traceback (most recent call last):
File "C:/Users/fabiansvenkarst/Documents/BA/Wiki_py27/Wiki_corpus_Verarbeitung.py", line 20, in <module>
wiki_corpus = WikiCorpus('enwiki-20100312-pages-articles.xml.bz2', lemmatize=True)
File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\wikicorpus.py", line 552, in __init__
self.dictionary = dictionary or Dictionary(self.get_texts())
File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\dictionary.py", line 79, in __init__
self.add_documents(documents, prune_at=prune_at)
File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\dictionary.py", line 187, in add_documents
for docno, document in enumerate(documents):
File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\wikicorpus.py", line 587, in get_texts
for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\utils.py", line 1219, in chunkize
for chunk in chunkize_serial(corpus, chunksize, as_numpy=as_numpy):
File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\utils.py", line 1153, in chunkize_serial
wrapped_chunk = [list(itertools.islice(it, int(chunksize)))]
File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\wikicorpus.py", line 579, in <genexpr>
((text, self.lemmatize, title, pageid, tokenization_params)
File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\wikicorpus.py", line 370, in extract_pages
ns = elem.find(ns_path).text
AttributeError: 'NoneType' object has no attribute 'text'
Versions
('Python', '2.7.11 (v2.7.11:6d1b6a68f775, Dec 5 2015, 20:40:30) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.14.3')
('SciPy', '1.1.0')
('gensim', '3.4.0')
('FAST_VERSION', 0)
Description
When I try to load the following Wikicorpus from 2010 (Link) I get an error (see bellow)
Code
Expected Results
Correct processing of the wikipedia corpus
Actual Results
Versions
('Python', '2.7.11 (v2.7.11:6d1b6a68f775, Dec 5 2015, 20:40:30) [MSC v.1500 64 bit (AMD64)]') ('NumPy', '1.14.3') ('SciPy', '1.1.0') ('gensim', '3.4.0') ('FAST_VERSION', 0)