Open angelo337 opened 8 years ago
Looks like your title_tokens.txt.gz
file contains invalid utf8 -- can you check?
That is, ignore gensim and any machine learning. Just iterate through the zip file, check that every line is in utf8.
thaks a lot, how can i Fix that ?
I am trying to test it on English and also in spanish with accents (like á, é, í, ú, ó), please I have to keep all accents in spanish because a work with accents is not the same as no accents.
example:
si - if sí - yes
mi - my mí - me
el - the él - he
tu - your tú - you
thanks
From: Radim Rehurek notifications@github.com Sent: Tuesday, June 21, 2016 7:41 AM To: piskvorky/sim-shootout Cc: Angelo Rodriguez; Author Subject: Re: [piskvorky/sim-shootout] UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte (#4)
Looks like your title_tokens.txt.gz file contains invalid utf8 -- can you check?
That is, ignore gensim and any machine learning. Just iterate through the zip file, check that every line is in utf8.
You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/piskvorky/sim-shootout/issues/4#issuecomment-227427495, or mute the threadhttps://github.com/notifications/unsubscribe/AMygyH8CfDLwqAcqR3auAPqvuJpYRyYYks5qN9wRgaJpZM4I6qBx.
hi there I am trying to run your code however I am getting this error every time and I am not sure how to solve it, could you please help me out? this is the out put:
2016-06-20 21:49:27,802 : INFO : adding document #0 to Dictionary(0 unique tokens: []) /webdav/storage/wikipedia/title_tokens.txt.gz Traceback (most recent call last): File "./prepare_shootout.py", line 158, in
corpus = ShootoutCorpus(gensim.utils.smart_open(preprocessed_file))
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/textcorpus.py", line 61, in init
self.dictionary.add_documents(self.get_texts())
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 127, in add_documents
self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/dictionary.py", line 154, in doc2bow
counter[w if isinstance(w, unicode) else unicode(w, 'utf-8')] += 1
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 5: invalid continuation byte
gzip: /webdav/storage/wikipedia/lsi_vectors.mm: No such file or directory
THANKS