klb3713 / sentence2vec

Tools for mapping a sentence with arbitrary length to vector space
664 stars 225 forks source link

How to resolve "UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 566: invalid start byte" #14

Open Yahsaswi opened 7 years ago

Yahsaswi commented 7 years ago

I have some large text files which have such characters and i wish to ignore such characters and proceede with the sentToVec conversion .. I see the below error , please help me fix this . File "kfold1.py", line 34, in model = Sent2Vec(LineSentence(sent_file), model_file=input_file + '.model') File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 800, in init self.reset_sent_vec(sentences) File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 809, in reset_sent_vec for sent in sentences: File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 1113, in iter yield utils.to_unicode(line).split() File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/utils.py", line 190, in any2unicode return unicode(text, encoding, errors=errors) File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 566: invalid start byte

sumehta commented 7 years ago

I get the same error when I try to load the model using, model = Word2Vec.load_word2vec_format('test.txt.model', binary=True)