nat-n / poethepoet

A task runner that works well with poetry.
https://poethepoet.natn.io/
MIT License
1.4k stars 58 forks source link

Open file with encoding="utf-8" specified #19

Closed LussacZheng closed 3 years ago

LussacZheng commented 3 years ago

Issue

If there are Chinese characters in pyproject.toml, poe will fail to run tasks.
I am not sure if other special characters might cause the same problem.

Platform

pyproject.toml

To explain beforehand, the Chinese word 你好 means "hello". And the word 作者, mentioned in the next section, means "author". I choose these two words because they can exactly reproduce this issue.

[tool.poetry]
name = "test"
version = "0.1.0"
description = ""
authors = ["Author"]

[tool.poetry.dependencies]
python = "^3.7"

[tool.poetry.dev-dependencies]
poethepoet = "^0.9.0"

[tool.poe.tasks]
t1 = { shell = "echo Hello" }
t2 = { shell = "echo 你好" }

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

The pyproject.toml was saved with UTF-8 encoding, and LF line-endings.
When I try to run poetry run poe t1 , the error shows up as:

$ poetry shell
$ poe t1
Poe => echo Hello
Hello

$ poe t2
Poe => echo 浣犲ソ
浣犲ソ

It was supposed to print 你好 ("Hello" in Chinese) instead of 浣犲ソ (a random and meaningless word), which is caused by a typical encode/decode issue.

Reason

According to Python Standard Library documentation about the built-in function open():

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

The default encoding seems to be "gbk" in a Windows 10 Simplified Chinese operating system.

$ python
# Python 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)] on win32
# Type "help", "copyright", "credits" or "license" for more information.

>>> "你好".encode("utf-8")
b'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> "浣犲ソ".encode("gbk")
b'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> "你好".encode("utf-8").decode("gbk")
'浣犲ソ'

Once turning my pyproject.toml into GBK encoding, the poe task works.

$ poe -v t2
Poe => echo 你好
你好

Test

There is another case which can also support this reason (actually this is why I found this issue):

Change the 5th line of pyproject.toml from authors = ["Author"] to authors = ["作者"]
(作者 means "author" in Chinese):

[tool.poetry]
name = "test"
version = "0.1.0"
description = ""
authors = ["作者"]

[tool.poetry.dependencies]
# ...

Then Poe will crash when running tasks:

$ poe -v t1
Poe the Poet - A task runner that works well with poetry.
version 0.9.0

Error: Couldn't open file at D:\workspace\poe\pyproject.toml

USAGE
  poe [-h] [-v | -q] [--root PATH] [--ansi | --no-ansi] task [task arguments]

# ...

NO TASKS CONFIGURED

Inject the debug code into poethepoet/config.py:

https://github.com/nat-n/poethepoet/blob/996c75b9505e6923873a4ede976bc36863f32398/poethepoet/config.py#L160-L167

    def _read_pyproject(path: Path) -> Mapping[str, Any]:
        import traceback
        try:
            with path.open() as pyproj:
                return tomlkit.parse(pyproj.read())
        except tomlkit.exceptions.TOMLKitError as error:
            raise PoeException(f"Couldn't parse toml file at {path}", error) from error
        except Exception as error:
            traceback.print_exc()
            raise PoeException(f"Couldn't open file at {path}") from error
$ poe -v t1
Traceback (most recent call last):
  File "c:\users\lussac\appdata\local\pypoetry\cache\virtualenvs\test-q7sigs7x-py3.7\lib\site-packages\poethepoet\config.py", line 149, in _read_pyproject
    return tomlkit.parse(pyproj.read())
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 83: illegal multibyte sequence

Poe the Poet - A task runner that works well with poetry.
# ...
$ python

>>> "作者".encode("utf-8").decode("gbk")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 4: illegal multibyte sequence

Please let me know if there are any mistakes above.


References

  1. codecs — Codec registry and base classes — Python documentation
  2. utf 8 - python 3.0 open() default encoding - Stack Overflow
  3. Python 3 Default Encoding cp1252 - Stack Overflow
LussacZheng commented 3 years ago

There are several open() in tests/conftest.py. I am not sure if I should make the same changes to them.

nat-n commented 3 years ago

@LussacZheng thanks for investigating (and explaining) this issue. I've learned something.

It wouldn't hurt to make the open calls in conftest.py explicit as well, although it shouldn't really matter since these will only refer to files within the project.

If you rebase onto master then the CI tests should work again.