oracle / graalpython

A Python 3 implementation built on GraalVM
Other
1.2k stars 103 forks source link

"SyntaxError: Non-UTF-8 code" but works fine in CPython #332

Closed The-Alchemist closed 1 year ago

The-Alchemist commented 1 year ago

Offending File

The file is https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/examples/full-screen/ansi-art-and-textarea.py#L53

GraalPy error

File "../pptk/python-prompt-toolkit/examples/full-screen/ansi-art-and-textarea.py", line 58
SyntaxError: Non-UTF-8 code starting with '\xe2' in file ../play_with_pptk/python-prompt-toolkit/examples/full-screen/ansi-art-and-textarea.py on line 59, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Workaround

Add the following line as line 2:

# encoding=UTF-8

Details

Oddly, I can't find \xe2 in the file, the character it's complaining about.

> grep 'xe2' ../pptk/python-prompt-toolkit/examples/full-screen/ansi-art-and-textarea.py
>

Is the locale not read correctly?

> locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

File to Reproduce

I've created a small file to reproduce the issue. Just rename to .py:

graalpython_issue_332.py.txt

msimacek commented 1 year ago

I cannot reproduce it with either the attached file nor with the prompt-toolkit example. Are you sure you really ran it with GraalPy? I couldn't find any part of that error message anywhere in our source except some test files copied over from CPython (and we fail those tests).

There is a lot of \xe2 bytes in those files, the \x notation displays hexadecimal values of bytes, so it's a byte with value 226, you cannot find it with grep easily. If you open it in python with mode 'rb', you'll see them. Those bytes are not valid UTF-8 characters on their own, they are part of a sequence that forms a valid character. For example try: b'\xe2\x96\x80'.decode('utf-8'), it's some sort of square-looking character. But just b'\xe2'.decode('utf-8') would fail. It looks like your file got somehow cut in the middle or the read got interrupted. But I still don't see how you could get that error with GraalPy. When I manually cut your file after \xe2 and run it with CPython, I get exactly the message you got. But when I run it with GraalPy, I get SyntaxError: unterminated triple-quoted string literal (detected at line 2) instead.

The-Alchemist commented 1 year ago

Sorry for wasting your time, @msimacek . I think I was accidentally using Python 3.7 instead. 🤦