I have some large text files which have such characters and i wish to ignore such characters and proceede with the sentToVec conversion .. I see the below error , please help me fix this .
File "kfold1.py", line 34, in
model = Sent2Vec(LineSentence(sent_file), model_file=input_file + '.model')
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 800, in init
self.reset_sent_vec(sentences)
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 809, in reset_sent_vec
for sent in sentences:
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 1113, in iter
yield utils.to_unicode(line).split()
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 566: invalid start byte
I have some large text files which have such characters and i wish to ignore such characters and proceede with the sentToVec conversion .. I see the below error , please help me fix this . File "kfold1.py", line 34, in
model = Sent2Vec(LineSentence(sent_file), model_file=input_file + '.model')
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 800, in init
self.reset_sent_vec(sentences)
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 809, in reset_sent_vec
for sent in sentences:
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/word2vec.py", line 1113, in iter
yield utils.to_unicode(line).split()
File "/Users/ypochampally/Documents/RESEARCH/workspace_latest/Wiki_Actors/src/om_TextClassification/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 566: invalid start byte