llSourcell / tensorflow_chatbot

Tensorflow chatbot demo by @Sirajology on Youtube
1.45k stars 808 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position N: invalid start byte #17

Open phrohdoh opened 7 years ago

phrohdoh commented 7 years ago

This is the traceback:

(venv_tf_chatbot) tmba:tensorflow_chatbot thill $ python execute.py

>> Mode : train

Preparing data in working_dir/
Tokenizing data in data/test.enc
Traceback (most recent call last):
  File "execute.py", line 319, in <module>
    train()
  File "execute.py", line 127, in train
    enc_train, dec_train, enc_dev, dec_dev, _, _ = data_utils.prepare_custom_data(gConfig['working_directory'],gConfig['train_enc'],gConfig['train_dec'],gConfig['test_enc'],gConfig['test_dec'],gConfig['enc_vocab_size'],gConfig['dec_vocab_size'])
  File "/Users/thill/projects/play/python/tensorflow_chatbot/data_utils.py", line 146, in prepare_custom_data
    data_to_token_ids(test_enc, enc_dev_ids_path, enc_vocab_path, tokenizer)
  File "/Users/thill/projects/play/python/tensorflow_chatbot/data_utils.py", line 119, in data_to_token_ids
    for line in data_file:
  File "/Users/thill/projects/play/python/tensorflow_chatbot/venv_tf_chatbot/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 162, in __next__
    return self.next()
  File "/Users/thill/projects/play/python/tensorflow_chatbot/venv_tf_chatbot/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 156, in next
    retval = self.readline()
  File "/Users/thill/projects/play/python/tensorflow_chatbot/venv_tf_chatbot/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 124, in readline
    return compat.as_str_any(self._read_buf.ReadLineAsString())
  File "/Users/thill/projects/play/python/tensorflow_chatbot/venv_tf_chatbot/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 106, in as_str_any
    return as_str(value)
  File "/Users/thill/projects/play/python/tensorflow_chatbot/venv_tf_chatbot/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 84, in as_text
    return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 16: invalid start byte

Each time I tried to run python execute.py I got a different position N in the traceback. After 5 attempted runs something finally fixed itself and I can now run as expected.

Has anyone else encountered this?

jprissi commented 7 years ago

The movie_lines.txt file (along with others .txt files from Cornell's movie corpus) had been encoded using windows-1252 codec, an outdated latin codec. You have to convert all the non-unicode byte to their unicode equivalent. Did it by hand but you might find a better way (there are a few thousand).

guduxingzou commented 7 years ago

anyone know how to fix this???

jprissi commented 7 years ago

@guduxingzou

You have to convert all the non-unicode byte to their unicode equivalent.

Search for all the non-unicode bytes, read their windows-1252 equivalent, replace it by an equivalent unicode character. Some of these characters are 'i acute' corresponding to an 'i' or 'é' corresponding to 'é'.

guduxingzou commented 7 years ago

@HazeHub I use windows py35! I deleted the files,then build the new one use my data.BUT also Traceback (most recent call last):

File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2902, in run_code self.showtraceback()

File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 1832, in showtraceback self._showtraceback(etype, value, stb)

File "C:\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 448, in _showtraceback exc_msg = dh.session.send(dh.pub_socket, u'error', json_clean(exc_content), dh.parent_header, ident=topic)

File "C:\Anaconda3\lib\site-packages\jupyter_client\session.py", line 673, in send to_send = self.serialize(msg, ident)

File "C:\Anaconda3\lib\site-packages\jupyter_client\session.py", line 576, in serialize content = self.pack(content)

File "C:\Anaconda3\lib\site-packages\jupyter_client\session.py", line 95, in ensure_ascii=False, allow_nan=False,

File "C:\Anaconda3\lib\site-packages\zmq\utils\jsonapi.py", line 43, in dumps s = s.encode('utf8')

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd5' in position 73: surrogates not allowed

jprissi commented 7 years ago

I don't recognize these files and can't tell you where you should look for this '\udcd5' character. Is it all the error message or do you have more ? I'd also like to tell you that you will have to fix ALL the regular expressions calls to make the program work (at least three to fix if I remember).

EDIT : You should check all the data files encoding are set to 'utf-8' as this type of stack trace is related to this type of issue.

guduxingzou commented 7 years ago

@HazeHub ths! I use ubuntu to test, Windows is not success always。 Can you mail the files which you have cleaned to me? my email 69462803@qq.com

不胜感激!

guduxingzou commented 7 years ago

I use notpad++ to clean the data. The method as follow: 英文方法: 第一步: 将换行符替换掉 方法 \n替换成 iiii 第二步: [^A-Za-z ,.!?'$+-] 替换成空格

第三步: iiii 再替换成\n

suyashbansal commented 7 years ago

@guduxingzou Can you share the cleaned files. I was not able to understand the method you mentioned.

jprissi commented 7 years ago

train.enc.txt train.dec.txt test.enc.txt test.dec.txt

Just remove all the .txt at the end (Github doesn't accept .dec and .enc files).

suyashbansal commented 7 years ago

@HazeHub I tried your files. Still getting the same error.

image

jprissi commented 7 years ago

yup sorry, I tried too on a fresh install and the unicode problem comes from vocab20000.enc and vocab20000.dec. Here you go : vocab20000.dec.txt vocab20000.enc.txt

papercodeIN commented 6 years ago

UnicodeDecodeError Traceback (most recent call last)

in () ----> 1 train_top_model() in train_top_model() 1 def train_top_model(): ----> 2 train_data = np.load(open('bottleneck_features_train.npy')) 3 train_labels = np.array( 4 [0] * (nb_train_samples / 2) + [1] * (nb_train_samples / 2)) 5 /usr/local/lib/python3.6/dist-packages/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding) 400 _ZIP_PREFIX = b'PK\x03\x04' 401 N = len(format.MAGIC_PREFIX) --> 402 magic = fid.read(N) 403 # If the file size is less than N, we need to make sure not 404 # to seek past the beginning of the file /usr/lib/python3.6/codecs.py in decode(self, input, final) 319 # decode input (taking the buffer into account) 320 data = self.buffer + input --> 321 (result, consumed) = self._buffer_decode(data, self.errors, final) 322 # keep undecoded input until the next call 323 self.buffer = data[consumed:] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte what should i do ?
e3oroush commented 6 years ago

You can always safely read in binary mode and decode it in utf8 with ignore mode. In Python something like this:

with open(filename, 'rb') as f:
     lines = [l.decode('utf8', 'ignore') for l in f.readlines()]
tissy91 commented 6 years ago

UnicodeDecodeError Traceback (most recent call last) pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 127: invalid start byte

During handling of the above exception, another exception occurred:

UnicodeDecodeError Traceback (most recent call last)

in () ----> 1 df_csv = pandas.read_csv("imdb.csv", sep=',') H:\Python\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 707 skip_blank_lines=skip_blank_lines) 708 --> 709 return _read(filepath_or_buffer, kwds) 710 711 parser_f.__name__ = name H:\Python\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 453 454 try: --> 455 data = parser.read(nrows) 456 finally: 457 parser.close() H:\Python\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1067 raise ValueError('skipfooter not supported for iteration') 1068 -> 1069 ret = self._engine.read(nrows) 1070 1071 if self.options.get('as_recarray'): H:\Python\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1837 def read(self, nrows=None): 1838 try: -> 1839 data = self._reader.read(nrows) 1840 except StopIteration: 1841 if self._first_chunk: pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert() pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 127: invalid start byte how to solve this
gowthamdada commented 6 years ago

import pandas as pd from pandas import DataFrame import numpy as np import matplotlib.pyplot as plt import datetime import pandas as pd from pandas import DataFrame import numpy as np import matplotlib.pyplot as plt import datetime

import csv df = pd.read_csv('cars_sample.csv',delimiter = ',',encoding = "utf-8") print (df)

UnicodeDecodeError Traceback (most recent call last) pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas_libs\parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 32: invalid start byte

During handling of the above exception, another exception occurred:

UnicodeDecodeError Traceback (most recent call last)

in () 1 import csv ----> 2 df = pd.read_csv('cars_sample.csv',delimiter = ',',encoding = "utf-8") 3 print (df) ~\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision) 676 skip_blank_lines=skip_blank_lines) 677 --> 678 return _read(filepath_or_buffer, kwds) 679 680 parser_f.__name__ = name ~\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 444 445 try: --> 446 data = parser.read(nrows) 447 finally: 448 parser.close() ~\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1034 raise ValueError('skipfooter not supported for iteration') 1035 -> 1036 ret = self._engine.read(nrows) 1037 1038 # May alter columns / col_dict ~\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1846 def read(self, nrows=None): 1847 try: -> 1848 data = self._reader.read(nrows) 1849 except StopIteration: 1850 if self._first_chunk: pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._string_convert() pandas\_libs\parsers.pyx in pandas._libs.parsers._string_box_utf8() UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 32: invalid start byte **HELP ME TO SOLVE THIS**
gowthamdada commented 6 years ago

import pandas as pd from pandas import DataFrame import numpy as np import matplotlib.pyplot as plt import datetime

import csv data = pd.read_csv(C:\Users\Dada\Documents\Untitled-Folder\cars_sample.csv, encoding = "utf-8") print(df)

File "", line 2 df = pd.read_csv('C:\Users\Dada\Documents\Untitled-Folder\cars_sample.csv',delimiter = ',',encoding = "utf-8") ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Ashi-s commented 5 years ago

The movie_lines.txt file (along with others .txt files from Cornell's movie corpus) had been encoded using windows-1252 codec, an outdated latin codec. You have to convert all the non-unicode byte to their unicode equivalent. Did it by hand but you might find a better way (there are a few thousand).

Hey, can you share the link of the dataset?

ahmadbelal7861 commented 5 years ago

stime = time.time() for name in trainingfilenames.keys(): if name == 'images': train_imagesfile = open(trainingfilenames['images'],'rb') if name == 'labels': train_labelsfile = open(trainingfilenames['labels'],'rb')

train_imagesfile.seek(0) magic = st.unpack('>4B',train_imagesfile.read(4)) if(magic[0] and magic[1])or(magic[2] not in data_types): raise ValueError("File Format not correct")

^ ValueError Traceback (most recent call last)

in () 9 magic = st.unpack('>4B',train_imagesfile.read(4)) 10 if(magic[0] and magic[1])or(magic[2] not in data_types): ---> 11 raise ValueError("File Format not correct") **ValueError: File Format not correct** How to solve it..?
Anurag1166 commented 5 years ago

i am facing same error of unicode,i also tried all methods which i saw on net but not single one is working for me please guide where i am doing mistake tried below putting r in starting for rawstring // /// ,delimiter = ',',encoding = "utf-8"

Code is ''import pandas as pd

data=pd.read_csv(r'C:\Users\AnuAbhi\PycharmProjects\First.py\sample.csv',delimiter = ',',encoding = "utf-8") print(data)

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 1: invalid start byte

leokwu commented 5 years ago

I met the same problem:

Use tf.gfile.GFile. Traceback (most recent call last): File "/usr/local/bin/freeze_graph", line 11, in sys.exit(run_main()) File "/home/wuli/.local/lib/python3.6/site-packages/tensorflow/python/tools/freeze_graph.py", line 488, in run_main app.run(main=my_main, argv=[sys.argv[0]] + unparsed) File "/home/wuli/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/wuli/.local/lib/python3.6/site-packages/tensorflow/python/tools/freeze_graph.py", line 487, in my_main = lambda unused_args: main(unused_args, flags) File "/home/wuli/.local/lib/python3.6/site-packages/tensorflow/python/tools/freeze_graph.py", line 381, in main flags.saved_model_tags, checkpoint_version) File "/home/wuli/.local/lib/python3.6/site-packages/tensorflow/python/tools/freeze_graph.py", line 344, in freeze_graph input_meta_graph, input_binary) File "/home/wuli/.local/lib/python3.6/site-packages/tensorflow/python/tools/freeze_graph.py", line 268, in _parse_input_meta_graph_proto text_format.Merge(f.read(), input_meta_graph_def) File "/home/wuli/.local/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 132, in read pywrap_tensorflow.ReadFromStream(self._read_buf, length, status)) File "/home/wuli/.local/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 100, in _prepare_value return compat.as_str_any(val) File "/home/wuli/.local/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 107, in as_str_any return as_str(value) File "/home/wuli/.local/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 80, in as_text return bytes_or_text.decode(encoding) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 1: invalid start byte

purnasai commented 5 years ago

please check this one worked for me data = pd.read_csv('airdata.csv', delimiter = ',' , encoding = 'unicode_escape')

a4ter commented 5 years ago

I had the same issue then fixed by open it in notepad and when you said it select utf-8 type.

dragolemguty commented 4 years ago

UwU

kiranbeethoju commented 4 years ago

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position someNumber: invalid start byte Facing this error ?

Whenever I come Github for solution I can see only garbage comments and status shows as this problem is closed WTH... Here is the solution which will definitely work to solve that issue data = pd.read_csv("your.csv", encoding='cp1252')

hasnainshagaf commented 4 years ago

I am facing the same issue but my files are Midi files so the below resolution is not working for me. data = pd.read_csv("your.csv", encoding='cp1252')

I also tried the notepad++ Utf-8 encoding and then re-saving the files but my files are getting corrupted.

Please suggests any resolution, I have tried all the things but it is not getting resolved.

Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 10: invalid start byte

kiranbeethoju commented 4 years ago

You can try this one ...

data = pd.read_csv("your.csv", encoding='latin1')

Swati640 commented 3 years ago

You can always safely read in binary mode and decode it in utf8 with ignore mode. In Python something like this:

with open(filename, 'rb') as f:
     lines = [l.decode('utf8', 'ignore') for l in f.readlines()]

This worked for me. Thank you :)