$ sudo python create_corpus.py
/usr/local/lib/python2.7/dist-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
warnings.warn("Pattern library is not installed, lemmatization won't be available.")
Starting to create wiki corpus
Traceback (most recent call last):
File "create_corpus.py", line 15, in
output.write(article + "\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
I got the same error on the following two environments:
Ubuntu 16.04 LTS, python 2.7.12 (native)
Ubuntu 14.04 LTS, python 2.7.6 on VirtualBox 5.1.8 (windows 10)
Hi, Thank you for submitting this very useful code. Unfortunately I ran into the following error trying to create the corpus for the hebrew wikipedia (Wikipedia dump taken from: https://dumps.wikimedia.org/hewiki/latest/hewiki-latest-pages-articles.xml.bz2)
$ sudo python create_corpus.py /usr/local/lib/python2.7/dist-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available. warnings.warn("Pattern library is not installed, lemmatization won't be available.") Starting to create wiki corpus Traceback (most recent call last): File "create_corpus.py", line 15, in
output.write(article + "\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
I got the same error on the following two environments: Ubuntu 16.04 LTS, python 2.7.12 (native) Ubuntu 14.04 LTS, python 2.7.6 on VirtualBox 5.1.8 (windows 10)