liorshk / wordembedding-hebrew

The code behind the blog post: https://www.oreilly.com/learning/capturing-semantic-meanings-using-deep-learning
33 stars 13 forks source link

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) #2

Closed johnyboyoh closed 7 years ago

johnyboyoh commented 7 years ago

Hi, Thank you for submitting this very useful code. Unfortunately I ran into the following error trying to create the corpus for the hebrew wikipedia (Wikipedia dump taken from: https://dumps.wikimedia.org/hewiki/latest/hewiki-latest-pages-articles.xml.bz2)

$ sudo python create_corpus.py /usr/local/lib/python2.7/dist-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available. warnings.warn("Pattern library is not installed, lemmatization won't be available.") Starting to create wiki corpus Traceback (most recent call last): File "create_corpus.py", line 15, in output.write(article + "\n") UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

I got the same error on the following two environments: Ubuntu 16.04 LTS, python 2.7.12 (native) Ubuntu 14.04 LTS, python 2.7.6 on VirtualBox 5.1.8 (windows 10)

liorshk commented 7 years ago

Thanks for the feedback, please confirm that it's fixed.

johnyboyoh commented 7 years ago

Thanks. I believe this fixed the issue.