Open alucard001 opened 6 years ago
I have the same error as yours. Have tried many different methods from stackoverflow but the error still remains.
I ran into the same error and found this on a PyTorch tutorial. Open the file with encoding 'iso-8859-1'
:
with open('cornell movie-dialogs corpus/movie_lines.txt', 'r', encoding='iso-8859-1') as f:
for line in f.readlines():
sentences[line.split(' +++$+++ ')[0]] = line.split(' +++$+++ ')[-1].replace('\n', "")
Thanx @tomasn4a . It worked!
hello i am still getting this error UnicodeDecodeError Traceback (most recent call last) in () ----> 1 cleaned_questions, cleaned_answers = clean_data()
1 frames /content/cornell_data_utils.py in clean_data() 107 108 with open('movie_questions_2.txt', 'r') as f: --> 109 lines = f.readlines() 110 for line in lines: 111 cleaned_questions.append(cornell_tokenizer(line))
/usr/lib/python3.6/codecs.py in decode(self, input, final) 319 # decode input (taking the buffer into account) 320 data = self.buffer + input --> 321 (result, consumed) = self._buffer_decode(data, self.errors, final) 322 # keep undecoded input until the next call 323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 1085: invalid start byte
Dear Luka
Thanks for this repository. I am currently learning from it and I found the following error from the very beginning of loading the dataset:
And the error is this:
Even if I download directly these text files from your repo:
movie_answers_2.txt
andmovie_questions_2.txt
, it shows same error:Can you please tell me what happened and how to fix this?
Thank you very much.