piskvorky / sim-shootout

Code for "Performance shootout between nearest-neighbour libraries": http://radimrehurek.com/2013/11/performance-shootout-of-nearest-neighbours-intro
MIT License
100 stars 28 forks source link

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte #4

Open angelo337 opened 8 years ago

angelo337 commented 8 years ago

hi there I am trying to run your code however I am getting this error every time and I am not sure how to solve it, could you please help me out? this is the out put:

2016-06-20 21:49:27,802 : INFO : adding document #0 to Dictionary(0 unique tokens: []) /webdav/storage/wikipedia/title_tokens.txt.gz Traceback (most recent call last): File "./prepare_shootout.py", line 158, in corpus = ShootoutCorpus(gensim.utils.smart_open(preprocessed_file)) File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/textcorpus.py", line 61, in init self.dictionary.add_documents(self.get_texts()) File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 127, in add_documents self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 154, in doc2bow counter[w if isinstance(w, unicode) else unicode(w, 'utf-8')] += 1 UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte gzip: /webdav/storage/wikipedia/lsi_vectors.mm: No such file or directory

THANKS

piskvorky commented 8 years ago

Looks like your title_tokens.txt.gz file contains invalid utf8 -- can you check?

That is, ignore gensim and any machine learning. Just iterate through the zip file, check that every line is in utf8.

angelo337 commented 8 years ago

thaks a lot, how can i Fix that ?

I am trying to test it on English and also in spanish with accents (like á, é, í, ú, ó), please I have to keep all accents in spanish because a work with accents is not the same as no accents.

example:

si - if sí - yes

mi - my mí - me

el - the él - he

tu - your tú - you

thanks


From: Radim Rehurek notifications@github.com Sent: Tuesday, June 21, 2016 7:41 AM To: piskvorky/sim-shootout Cc: Angelo Rodriguez; Author Subject: Re: [piskvorky/sim-shootout] UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte (#4)

Looks like your title_tokens.txt.gz file contains invalid utf8 -- can you check?

That is, ignore gensim and any machine learning. Just iterate through the zip file, check that every line is in utf8.

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/piskvorky/sim-shootout/issues/4#issuecomment-227427495, or mute the threadhttps://github.com/notifications/unsubscribe/AMygyH8CfDLwqAcqR3auAPqvuJpYRyYYks5qN9wRgaJpZM4I6qBx.