rkern / line_profiler

(OLD REPO) Line-by-line profiling for Python - Current repo ->
https://github.com/pyutils/line_profiler
Other
3.61k stars 255 forks source link

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 553: illegal multibyte sequence #37

Closed xgdgsc closed 9 years ago

xgdgsc commented 9 years ago

On windows, if the python script is encoded with utf-8 while the system default encoding is gbk, when running kernprof -l it would throw error like:

Traceback (most recent call last):
  File "C:\Anaconda3\lib\runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Anaconda3\Scripts\kernprof.exe\__main__.py", line 9, in <module>
  File "C:\Anaconda3\lib\site-packages\kernprof.py", line 221, in main
    execfile(script_file, ns, ns)
  File "C:\Anaconda3\lib\site-packages\kernprof.py", line 34, in execfile
    exec_(compile(f.read(), filename, 'exec'), globals, locals)
UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 553: illegal multibyte sequence

I currently workaround this by converting the python file to gbk first, run kernprof. If I try to view the FILE.lprof file with python -m line_profiler FILE.lprof now, it would also give encoding error, and then I have to convert the python script back to utf-8 and run python -m line_profiler FILE.lprof to view the results. Is there a better way?

rkern commented 9 years ago

What "system default encoding" are you referring to?

xgdgsc commented 9 years ago

That is the encoding that windows defaults to when you choose your locale.

rkern commented 9 years ago

What does sys.getdefaultencoding() say?

xgdgsc commented 9 years ago

‘utf-8'

rkern commented 9 years ago

Can you post the full filename, provided it doesn't leak personal information? i.e. print(repr(filename)) just before the line that fails? I am wondering if byte 553 in that is causing the problem, because that's the only thing that should be being implicitly decoded as gbk.

rkern commented 9 years ago

The other thing to try for getting more information would be to take that execfile() function and try it on its own. Rewrite it to break down each step independently on its own line; e.g. first f.read(), then compile(), then exec(). The line that is throwing the exception is doing several things at once, and it's hard to know which one is actually throwing the error. Double-check which bytes object actually has the 0xaa byte.

xgdgsc commented 9 years ago

Actually if I remove all Chinese characters in comment and keep only English characters in the python script, the error won' t occur. So the filename doesn' t matter. The content matters.

rkern commented 9 years ago

Okay. I really don't know why Python would try to decode the file content as gbk unless if it declares itself so at the top.

xgdgsc commented 9 years ago

According to Processing Text Files in Python 3 — Nick Coghlan's Python Notes 1.0 documentation , python would use the result of:

locale.getpreferredencoding()

to read file, which is 'cp936' in my case.

xgdgsc commented 9 years ago

chardet seems to be able to recognize the file encoding correctly.

rkern commented 9 years ago

If you change open(filename) to open(filename, 'rb'), does that help?

I'm still not sure where the UnicodeDecodeError error referring to gbk would come from.

xgdgsc commented 9 years ago
with open(filename, 'rb') as f:

works!

yaoweihu commented 7 years ago

OK

gtaifu commented 6 years ago

According to the Python 3 Unicode document, you could specify the encoding while opening the file using the following line:

with open(filename, encoding='utf-8', mode = ‘r') as f:
    for line in f:
        print(repr(line))

In this way, each original character is still treated as a single character, and you do could operate part of the file without worrying about splitting the bytes of a single character.

lovechang1986 commented 6 years ago

GBK was not the latest , some special characters can't decide. You can try GB18030

suifengtec commented 6 years ago

the following method could fix it for me, it may work for you.

open(path, 'r', encoding='utf-8')

确定能用!

mystical1226 commented 5 years ago

the following method could fix it for me, it may work for you.

open(path, 'r', encoding='utf-8')

确定能用!

能用

qingchenwuhou commented 5 years ago

when read from std.in stream:

input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gb18030', errors="ignore") # here encoding='gbk' means the line is encoded in gbk style for line in input_stream:

wxqsql commented 5 years ago

We can try this: open("filename", encoding='ascii', errors='ignore') as f:

dangshanli commented 5 years ago

According to the Python 3 Unicode document, you could specify the encoding while opening the file using the following line:

with open(filename, encoding='utf-8', mode = ‘r') as f:
    for line in f:
        print(repr(line))

In this way, each original character is still treated as a single character, and you do could operate part of the file without worrying about splitting the bytes of a single character.

window上默认是gbk编码解析文件,通过指定确实就没问题了

lishaofeng commented 4 years ago

with open(file_path, 'rb') as file works for me, while with open(file_path, 'r', encoding='utf-8') does't work. However using rb add addtional characters b"" for my contents, so sad! Anyone has some good idea?! Thanks!