Open phrohdoh opened 7 years ago
The movie_lines.txt file (along with others .txt files from Cornell's movie corpus) had been encoded using windows-1252 codec, an outdated latin codec. You have to convert all the non-unicode byte to their unicode equivalent. Did it by hand but you might find a better way (there are a few thousand).
anyone know how to fix this???
@guduxingzou
You have to convert all the non-unicode byte to their unicode equivalent.
Search for all the non-unicode bytes, read their windows-1252 equivalent, replace it by an equivalent unicode character. Some of these characters are 'i acute' corresponding to an 'i' or 'é' corresponding to 'é'.
@HazeHub I use windows py35! I deleted the files,then build the new one use my data.BUT also Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2902, in run_code self.showtraceback()
File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 1832, in showtraceback self._showtraceback(etype, value, stb)
File "C:\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 448, in _showtraceback exc_msg = dh.session.send(dh.pub_socket, u'error', json_clean(exc_content), dh.parent_header, ident=topic)
File "C:\Anaconda3\lib\site-packages\jupyter_client\session.py", line 673, in send to_send = self.serialize(msg, ident)
File "C:\Anaconda3\lib\site-packages\jupyter_client\session.py", line 576, in serialize content = self.pack(content)
File "C:\Anaconda3\lib\site-packages\jupyter_client\session.py", line 95, in
File "C:\Anaconda3\lib\site-packages\zmq\utils\jsonapi.py", line 43, in dumps s = s.encode('utf8')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd5' in position 73: surrogates not allowed
I don't recognize these files and can't tell you where you should look for this '\udcd5' character. Is it all the error message or do you have more ? I'd also like to tell you that you will have to fix ALL the regular expressions calls to make the program work (at least three to fix if I remember).
EDIT : You should check all the data files encoding are set to 'utf-8' as this type of stack trace is related to this type of issue.
@HazeHub ths! I use ubuntu to test, Windows is not success always。 Can you mail the files which you have cleaned to me? my email 69462803@qq.com
不胜感激!
I use notpad++ to clean the data. The method as follow: 英文方法: 第一步: 将换行符替换掉 方法 \n替换成 iiii 第二步: [^A-Za-z ,.!?'$+-] 替换成空格
第三步: iiii 再替换成\n
@guduxingzou Can you share the cleaned files. I was not able to understand the method you mentioned.
train.enc.txt train.dec.txt test.enc.txt test.dec.txt
Just remove all the .txt at the end (Github doesn't accept .dec and .enc files).
@HazeHub I tried your files. Still getting the same error.
yup sorry, I tried too on a fresh install and the unicode problem comes from vocab20000.enc and vocab20000.dec. Here you go : vocab20000.dec.txt vocab20000.enc.txt
UnicodeDecodeError Traceback (most recent call last)
You can always safely read in binary mode and decode it in utf8 with ignore mode. In Python something like this:
with open(filename, 'rb') as f:
lines = [l.decode('utf8', 'ignore') for l in f.readlines()]
UnicodeDecodeError Traceback (most recent call last) pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()
pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 127: invalid start byte
During handling of the above exception, another exception occurred:
UnicodeDecodeError Traceback (most recent call last)
import pandas as pd from pandas import DataFrame import numpy as np import matplotlib.pyplot as plt import datetime import pandas as pd from pandas import DataFrame import numpy as np import matplotlib.pyplot as plt import datetime
UnicodeDecodeError Traceback (most recent call last) pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._string_convert()
pandas_libs\parsers.pyx in pandas._libs.parsers._string_box_utf8()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 32: invalid start byte
During handling of the above exception, another exception occurred:
UnicodeDecodeError Traceback (most recent call last)
import pandas as pd from pandas import DataFrame import numpy as np import matplotlib.pyplot as plt import datetime
import csv data = pd.read_csv(C:\Users\Dada\Documents\Untitled-Folder\cars_sample.csv, encoding = "utf-8") print(df)
File "
The movie_lines.txt file (along with others .txt files from Cornell's movie corpus) had been encoded using windows-1252 codec, an outdated latin codec. You have to convert all the non-unicode byte to their unicode equivalent. Did it by hand but you might find a better way (there are a few thousand).
Hey, can you share the link of the dataset?
stime = time.time() for name in trainingfilenames.keys(): if name == 'images': train_imagesfile = open(trainingfilenames['images'],'rb') if name == 'labels': train_labelsfile = open(trainingfilenames['labels'],'rb')
train_imagesfile.seek(0) magic = st.unpack('>4B',train_imagesfile.read(4)) if(magic[0] and magic[1])or(magic[2] not in data_types): raise ValueError("File Format not correct")
^ ValueError Traceback (most recent call last)
i am facing same error of unicode,i also tried all methods which i saw on net but not single one is working for me please guide where i am doing mistake tried below putting r in starting for rawstring // /// ,delimiter = ',',encoding = "utf-8"
Code is ''import pandas as pd
data=pd.read_csv(r'C:\Users\AnuAbhi\PycharmProjects\First.py\sample.csv',delimiter = ',',encoding = "utf-8") print(data)
error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 1: invalid start byte
I met the same problem:
Use tf.gfile.GFile.
Traceback (most recent call last):
File "/usr/local/bin/freeze_graph", line 11, in
please check this one worked for me data = pd.read_csv('airdata.csv', delimiter = ',' , encoding = 'unicode_escape')
I had the same issue then fixed by open it in notepad and when you said it select utf-8 type.
UwU
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position someNumber: invalid start byte Facing this error ?
Whenever I come Github for solution I can see only garbage comments and status shows as this problem is closed WTH...
Here is the solution which will definitely work to solve that issue
data = pd.read_csv("your.csv", encoding='cp1252')
I am facing the same issue but my files are Midi files so the below resolution is not working for me. data = pd.read_csv("your.csv", encoding='cp1252')
I also tried the notepad++ Utf-8 encoding and then re-saving the files but my files are getting corrupted.
Please suggests any resolution, I have tried all the things but it is not getting resolved.
Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 10: invalid start byte
You can try this one ...
data = pd.read_csv("your.csv", encoding='latin1')
You can always safely read in binary mode and decode it in utf8 with ignore mode. In Python something like this:
with open(filename, 'rb') as f: lines = [l.decode('utf8', 'ignore') for l in f.readlines()]
This worked for me. Thank you :)
This is the traceback:
Each time I tried to run
python execute.py
I got a different position N in the traceback. After 5 attempted runs something finally fixed itself and I can now run as expected.Has anyone else encountered this?