makcedward / nlp

:memo: This repository recorded my NLP journey.
https://makcedward.github.io/
1.07k stars 326 forks source link

InferSent error (help needed) #5

Open happypanda5 opened 5 years ago

happypanda5 commented 5 years ago

Hi, I am getting an error while generating InferSent embeddings. The error is as follows, with details at the end of this email

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 11: invalid start byte

The error occurs after I run infer_sent_embs.build_vocab(x_train, tokenize=True) .

Note that I ran your code in Google Colab. Also note that the links to InferSent in the python file infersent.py also need to be updated (expired links).

The new links are

INFERSENT_GLOVE_MODEL_URL = 'https://dl.fbaipublicfiles.com/infersent/infersent1.pkl' INFERSENT_FASTTEXT_MODEL_URL = 'https://dl.fbaipublicfiles.com/infersent/infersent2.pkl'

`

UnicodeDecodeError Traceback (most recent call last)

in () ----> 1 infer_sent_embs.build_vocab(x_train, tokenize=True) 2 x_train_t = infer_sent_embs.encode(x_train, tokenize=True) 3 x_test_t = infer_sent_embs.encode(x_test, tokenize=True) 3 frames /usr/lib/python3.6/codecs.py in decode(self, input, final) 319 # decode input (taking the buffer into account) 320 data = self.buffer + input --> 321 (result, consumed) = self._buffer_decode(data, self.errors, final) 322 # keep undecoded input until the next call 323 self.buffer = data[consumed:] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 11: invalid start byte `
happypanda5 commented 5 years ago

Please help @makcedward