Mangled sys.argv when command line arguments are non-ASCII #294

Open zmwangx opened 4 years ago

CPython on *ix systems decode command line arguments

with filesystem encoding and “surrogateescape” error handler

according to docs.

However, pyoxidize seems to attempt to set sys.argv to the raw bytes while sys.argv is eventually a list of strs, not bytes, so anything non-ASCII is mangled.

Consider this basically default pyoxidizer config, generating a Python REPL:

pyoxidizer.bzl

```bzl def make_dist(): return default_python_distribution() def make_exe(dist): policy = dist.make_python_packaging_policy() python_config = dist.make_python_interpreter_config() python_config.run_mode = "repl" exe = dist.to_python_executable( name = "python", packaging_policy = policy, config = python_config, ) return exe def make_embedded_resources(exe): return exe.to_embedded_resources() def make_install(exe): files = FileManifest() files.add_python_resource(".", exe) return files register_target("dist", make_dist) register_target("exe", make_exe, depends = ["dist"]) register_target("resources", make_embedded_resources, depends = ["exe"], default_build_script = True) register_target("install", make_install, depends = ["exe"], default = True) resolve_targets() # END OF COMMON USER-ADJUSTED SETTINGS. PYOXIDIZER_VERSION = "0.8.0" PYOXIDIZER_COMMIT = "UNKNOWN" ```

Now let's pass a Unicode argument. macOS:

$ ./build/x86_64-apple-darwin/release/install/python 中文  # 中文 is e4b8 ade6 9687 in UTF-8 encoding
Python 3.8.6 (default, Oct  3 2020, 13:58:55)
[Clang 10.0.1 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getfilesystemencoding(), sys.getdefaultencoding()
('utf-8', 'utf-8')
>>> sys.argv
['./build/x86_64-apple-darwin/release/install/python', '\xe4\xb8\xad\xe6\x96\x87']

On Linux it's even worse, I get an extra 0xdc inserted before each byte:

$ build/x86_64-unknown-linux-gnu/release/install/googler 中文
Python 3.8.6 (default, Oct  3 2020, 20:48:20)
[Clang 10.0.1 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getfilesystemencoding(), sys.getdefaultencoding()
('ascii', 'utf-8')
>>> sys.argv
['build/x86_64-unknown-linux-gnu/release/install/googler', '\udce4\udcb8\udcad\udce6\udc96\udc87']

(I can turn on ~~configure_locale~~ or utf8_mode to coerce sys.getfilesystemencoding() to utf-8, but sys.argv ends up being the same. Edit: configure_locale actually does work.)

Ultimately though whether an extra byte is inserted is irrelevant; as I said sys.argv is a str-based API, so I could hardly even code around this limitation, making pyoxidizer rather useless for anything that might involve non-ASCII command line arguments.

Reading #10 and https://github.com/indygreg/PyOxidizer/blob/7a222ac6fe12bd667869e2d47e75606f4717ebbc/pyembed/src/interpreter.rs#L429-L436 (but I haven't read the actual implementation) seems to suggest to me that valid Unicode arguments are supposed to be supported. Am I missing something obvious?

Yikes. This is a bad bug. Thank you for reporting!

The bug is almost certainly inside the pyembed crate as part of the linked code. That call to PySys_SetObject() to define sys.argv is very suspect. Especially in a Python 3.8 world where we should be using https://docs.python.org/3/c-api/init_config.html#c.PyConfig_SetArgv and https://docs.python.org/3/c-api/init_config.html#c.PyConfig_SetBytesArgv instead, as these will perform the string normalization that Python does.

A mistake in my original report: using configure_locale actually does work, unlike utf8_mode.

Now that I think about it, the mangled version on macOS is UTF-8 encoded string decoded using Latin-1 or something:

>>> '中文'.encode('utf-8').decode('latin-1', 'surrogateescape')
'ä¸\xadæ\x96\x87'

and the mangled version on Linux is UTF-8 encoded string decoded as ascii with surrogateescape:

>>> '中文'.encode('utf-8').decode('ascii', 'surrogateescape')
'\udce4\udcb8\udcad\udce6\udc96\udc87'

So the problem is osstr_to_pyobject tries to use the locale encoding (through PyUnicode_DecodeLocaleAndSize), but sys.argv is supposed to be decoded "with filesystem encoding and “surrogateescape” error handler", and there could be a mismatch here, which is manifest when on macOS or using UTF-8 mode elsewhere, and without locale (as is the default with pyoxidizer if I understand correctly).

Using PyConfig_SetBytesArgv with a raw argv pointer seems to be the safer bet here. Or maybe use Py_DecodeLocale coupled with PyUnicode_FromWideChar? Seeing that PyConfig_SetBytesArgv calls Py_DecodeLocale under the hood.

I just pushed a few commits to lightly refactor how argument initialization is handled. The added tests demonstrate some... odd behavior that still don't agree with python out of the box. I'm still wrapping my head around things. I'm wondering if we should set configure_locale or utf8_mode by default so built binaries behave more like python. I'll continue to poke at this. But any random insight others may have will probably be useful!

I'm particularly perplexed by what appears to be an endian mismatch when round tripping your example 中文 value on Windows! The UTF-16 u16 values are the same: just with different byte order!

Looking at Python/preconfig.c from CPython's code base:

# _PyPreConfig_InitCompatConfig() (called before the functions below)
    config->configure_locale = 1;

    /* bpo-36443: C locale coercion (PEP 538) and UTF-8 Mode (PEP 540)
       are disabled by default using the Compat configuration.

       Py_UTF8Mode=1 enables the UTF-8 mode. PYTHONUTF8 environment variable
       is ignored (even if use_environment=1). */
    config->utf8_mode = 0;
    config->coerce_c_locale = 0;
    config->coerce_c_locale_warn = 0;

# PyPreConfig_InitPythonConfig()
    /* Set to -1 to enable C locale coercion (PEP 538) and UTF-8 Mode (PEP 540)
       depending on the LC_CTYPE locale, PYTHONUTF8 and PYTHONCOERCECLOCALE
       environment variables. */
    config->coerce_c_locale = -1;
    config->coerce_c_locale_warn = -1;
    config->utf8_mode = -1;

# PyPreConfig_InitIsolatedConfig
    config->configure_locale = 0;
    config->utf8_mode = 0;

So the effective settings are:

Python:
    config->configure_locale = 1;
    config->utf8_mode = -1;
    config->coerce_c_locale = -1;
    config->coerce_c_locale_warn = -1;

Isolated:
    config->configure_locale = 0;
    config->utf8_mode = 0;
    config->coerce_c_locale = 0;
    config->coerce_c_locale_warn = 0;

That probably explains the differences, since PyOxidizer uses isolated mode by default.

@yuja: if you have any wisdom to share, I'd welcome your input!

PEP 587 has a nice comparison of Python and isolated configurations:

PyPreConfig	Python	Isolated
coerce_c_locale_warn	-1	0
coerce_c_locale	-1	0
configure_locale	1	0
dev_mode	-1	0
isolated	0	1
legacy_windows_fs_encoding	-1	0
use_environment	0	0
parse_argv	1	0
utf8_mode	-1	0

However, the isolated config is meant for embedding, enabling very precise control; the Python interpreter itself never seems to call PyPreConfig_InitIsolatedConfig, even though it has an "isolated mode" enabled by the -I option. If we look at https://github.com/python/cpython/blob/b67cbbda3a022cec5e2ad929f0531162166e7c8d/Python/pylifecycle.c#L808-L823, the interpreter calls _PyPreConfig_InitFromPreConfig which in turn invokes PyPreConfig_InitPythonConfig unconditionally. PEP 587 also documents -I (as well as Py_IsolatedFlag) as setting the isolated field to 1, while not explicitly affecting any other preconfig fields.

I patched CPython a bit to check exactly what -I affects. The patch:

diff --git a/Python/pylifecycle.c b/Python/pylifecycle.c
index 75d57805c0..0ac3a73661 100644
--- a/Python/pylifecycle.c
+++ b/Python/pylifecycle.c
@@ -817,6 +817,17 @@ _Py_PreInitializeFromPyArgv(const PyPreConfig *src_config, const _PyArgv *args)
         return status;
     }

+    printf("\n");
+    printf("coerce_c_locale_warn = %i\n", config.coerce_c_locale_warn);
+    printf("coerce_c_locale = %i\n", config.coerce_c_locale);
+    printf("configure_locale = %i\n", config.configure_locale);
+    printf("dev_mode = %i\n", config.dev_mode);
+    printf("isolated = %i\n", config.isolated);
+    printf("use_environment = %i\n", config.use_environment);
+    printf("parse_argv = %i\n", config.parse_argv);
+    printf("utf8_mode = %i\n", config.utf8_mode);
+    printf("\n");
+
     status = _PyPreConfig_Write(&config);
     if (_PyStatus_EXCEPTION(status)) {
         return status;

Now, on macOS, running the patched Python:

$ ./python.exe

coerce_c_locale_warn = 0
coerce_c_locale = 0
configure_locale = 1
dev_mode = 0
isolated = 0
use_environment = 1
parse_argv = 1
utf8_mode = 0

...

./python.exe -I

coerce_c_locale_warn = 0
coerce_c_locale = 0
configure_locale = 1
dev_mode = 0
isolated = 1
use_environment = 0
parse_argv = 1
utf8_mode = 0

...

so the isolated mode of the python executable seems to have configure_locale enabled anyway. I think that's probably a saner default.

UTF-8 mode on the other hand reduces a lot of I/O encoding woes so I personally like it, but I'm neutral on whether to make it default.

Whatever the eventual solution, IMO it's a good idea to handle non-ASCII (and non-Latin-1, I guess) arguments properly by default. Too many developers don't bother to test them, so if they're not handled by default, the problems tend to only be caught by users after the programs are shipped; and sometimes the problems manifest in totally non-obvious ways, causing massive confusion and wasting significant troubleshooting time. I only happened upon this thanks to an abundance of caution which I don't always practice...

AFAIK, Py_DecodeLocale() is the function to convert char **argv to Python sys.argv. Using any other encoding function would break the assumption.

https://www.mercurial-scm.org/repo/hg/rev/f2de8f31cb59

I think the value of configure_locale is a different issue, but enabling it would make sense because PyOxidizer is the tool to build a Python executable, not a library to be embedded in another application.

Thank you for the great comments @zmwangx and @yuja!

While I was going further down this rabbit hole, I realized that the existing Rust tests were all executing in the same process, using multiple threads. And since CPython uses global variables and calls functions that define global state (such as setlocale()), tests were effectively interfering with each other. I'm changing these low-level configuration tests to execute in separate processes so there is process isolation. This revealed a few issues with the code in main today.

I pushed a change to enable configure_locale by default. I've also added more tests demonstrating the wonky behavior without configure_locale. https://github.com/indygreg/PyOxidizer/blob/525f8b3c9e8e80591112c3abf9276e2939e22493/pyembed/src/test/interpreter_config.rs#L255 is probably the most interesting in terms of demonstrating bad behavior.

After enabling configure_locale, I can confirm that a test binary is able to decode a non-ascii argument:

$ ~/tmp/myapp/build/x86_64-unknown-linux-gnu/debug/install/myapp 中文
Python 3.8.6 (default, Oct  3 2020, 20:48:20)
[Clang 10.0.1 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.argv
['/home/gps/tmp/myapp/build/x86_64-unknown-linux-gnu/debug/install/myapp', '中文']

As part of writing the tests, I realized the cpython Rust crate will panic if the call to PyUnicode_AsUTF8AndSize() fails: https://github.com/dgrunwald/rust-cpython/blob/d77f80caaa36e323b74d904d9e2e03c2a7643d5a/src/objects/string.rs#L287 Yikes. I did run into this when writing tests.

Also, the cpython crate does have a mechanism to reflect the underlying storage type of PyString. However, it doesn't actually support materializing variants outside of UTF-8. This might be because PyUnicode_KIND(), PyUnicode_DATA(), and friends are macros. It isn't impossible to reimplement these macros in Rust. But it is unfortunate these only exist as macros and AFAICT there's no exported C function in Python's C API for accessing the underlying storage kind/data. Macro-only bindings because it difficult to write FFI bindings.

CC @vstinner in case you care about the Python C API feedback. You may also find the commentary about isolated mode interesting.

The table and the results of the python -I experiment in https://github.com/indygreg/PyOxidizer/issues/294#issuecomment-708969628 would be helpful to find in the config docs. Thanks, @zmwangx!

Hi,

I deeply rewrote the documentation of the Python initialization and encodings. Reviews are welcomed :-)

PEP 587 has a nice comparison of Python and isolated configurations:

I completed https://docs.python.org/dev/c-api/init_config.html#pyconfig documentation to document the default value of every single and PyPreConfig and PyConfig member.

I rewrote the documentation on how Python selects the filesystem encoding and error handlers:

https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.filesystem_encoding
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.filesystem_errors
It's now documented at a single place to avoid the risk of having outdated copies of the documentation.

And the stdio encoding and error handler:

https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.stdio_encoding

I documented the Python UTF-8 Mode:

https://docs.python.org/dev/library/os.html#utf8-mode

And added locale encoding and filesystem encoding and error handler entries to the glossary:

I added a warning to discourage users to use Py_DecodeLocale. PyConfig_SetBytesString() should be preferred since it ensures that Python is preinitialization:

The Python preinitialization can change the LC_CTYPE locale and enable the UTF-8 which impact the filesystem encoding and error handler. You must not call Py_DecodeLocale before Python is preinitialized.

Once Python is preinitialized, Python will stick to the filesystem encoding and error handler which cannot be changed at runtime on purpose.

So the problem is osstr_to_pyobject tries to use the locale encoding (through PyUnicode_DecodeLocaleAndSize), but sys.argv is supposed to be decoded "with filesystem encoding and “surrogateescape” error handler", and there could be a mismatch here, which is manifest when on macOS or using UTF-8 mode elsewhere, and without locale (as is the default with pyoxidizer if I understand correctly).

Low-level Py_DecodeLocale() and Py_EncodeLocale() functions must be use to decode from and encode to the filesystem encoding and error handlers:

These functions must not be used before Python is preinitialized: https://docs.python.org/dev/c-api/init_config.html#c-preinit

The filesystem encoding can be different than the the locale encoding in multiple cases: https://docs.python.org/dev/c-api/init_config.html#c.PyConfig.filesystem_encoding

Also, PyConfig_SetBytesString() is now recommended, instead of Py_DecodeLocale(), to decode byte strings. But it depends on your use case.

printf("coerce_c_locale_warn = %i\n", config.coerce_c_locale_warn);

import pprint, _testinternalcapi; pprint.pprint(_testinternalcapi.get_configs()['config']) can be used for debugging purpose to dump the current PyConfig configuration.

I think the value of configure_locale is a different issue, but enabling it would make sense because PyOxidizer is the tool to build a Python executable, not a library to be embedded in another application.

If you want to mimick Python behavior, I suggest you to use the "Python Configuration": https://docs.python.org/dev/c-api/init_config.html#python-configuration

Rather than the Isolated Configuration. But since there is no silver bullet, PyConfig gives you a full access to all configuration parameters to tune exactly Python as you wish ;-)

Anyway, good luck to initialize Python and use the correct encodings. These are two complex problems ;-)

@vstinner the new docs are amazing! Thank you for all your work to improve understanding about this complex and nuanced aspect of Python!

There is more. I'm working on a new API to be able to run Python code to configure Python (between the "core" and "main" initialization phases): https://bugs.python.org/issue42260

I'm rewriting Modules/getpath.c in Python to give control on how the Path Configuration is computed.

I pushed the first part, the bare minimum C API to do that in C. Dummy example just to modify the bytes_warning parameter:


static int tune_config(void)
{
    PyConfig config;
    PyConfig_InitPythonConfig(&config);
    if (_PyInterpreterState_GetConfigCopy(&config) < 0) {
        PyConfig_Clear(&config);
        PyErr_Print();
        return -1;
    }

    config.bytes_warning = 2;

    if (_PyInterpreterState_SetConfig(&config) < 0) {
        PyConfig_Clear(&config);
        return -1;
    }
    PyConfig_Clear(&config);
    return 0;
}

indygreg / PyOxidizer

Mangled sys.argv when command line arguments are non-ASCII #294