piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.71k stars 4.38k forks source link

Error while importing old wiki-dump (2010) with `WikiCorpus` #2046

Open Faruman opened 6 years ago

Faruman commented 6 years ago

Description

When I try to load the following Wikicorpus from 2010 (Link) I get an error (see bellow)

Code

from gensim.corpora import WikiCorpus, MmCorpus
import gensim
import pattern
import pickle
import logging
import os.path
import sys

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s", ' '.join(sys.argv))

    wiki_corpus = WikiCorpus('enwiki-20100312-pages-articles.xml.bz2', lemmatize=True)
    print('corpus loaded')

Expected Results

Correct processing of the wikipedia corpus

Actual Results

C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\Scripts\python.exe C:/Users/fabiansvenkarst/Documents/BA/Wiki_py27/Wiki_corpus_Verarbeitung.py
C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2018-05-12 04:38:51,516 : INFO : running C:/Users/fabiansvenkarst/Documents/BA/Wiki_py27/Wiki_corpus_Verarbeitung.py
Traceback (most recent call last):
  File "C:/Users/fabiansvenkarst/Documents/BA/Wiki_py27/Wiki_corpus_Verarbeitung.py", line 20, in <module>
    wiki_corpus = WikiCorpus('enwiki-20100312-pages-articles.xml.bz2', lemmatize=True)
  File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\wikicorpus.py", line 552, in __init__
    self.dictionary = dictionary or Dictionary(self.get_texts())
  File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\dictionary.py", line 79, in __init__
    self.add_documents(documents, prune_at=prune_at)
  File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\dictionary.py", line 187, in add_documents
    for docno, document in enumerate(documents):
  File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\wikicorpus.py", line 587, in get_texts
    for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
  File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\utils.py", line 1219, in chunkize
    for chunk in chunkize_serial(corpus, chunksize, as_numpy=as_numpy):
  File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\utils.py", line 1153, in chunkize_serial
    wrapped_chunk = [list(itertools.islice(it, int(chunksize)))]
  File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\wikicorpus.py", line 579, in <genexpr>
    ((text, self.lemmatize, title, pageid, tokenization_params)
  File "C:\Users\fabiansvenkarst\Documents\BA\Wiki_py27\venv\lib\site-packages\gensim\corpora\wikicorpus.py", line 370, in extract_pages
    ns = elem.find(ns_path).text
AttributeError: 'NoneType' object has no attribute 'text'

Versions

('Python', '2.7.11 (v2.7.11:6d1b6a68f775, Dec 5 2015, 20:40:30) [MSC v.1500 64 bit (AMD64)]') ('NumPy', '1.14.3') ('SciPy', '1.1.0') ('gensim', '3.4.0') ('FAST_VERSION', 0)

menshikh-iv commented 6 years ago

@Faruman thanks for the report, reproduced with python2.7 and gensim==3.5.0

piskvorky commented 5 years ago

@Faruman can you reproduce this with newer dumps? Or why do you need the 2010 Wiki dump? It's pretty old.