2c0f8f82-20db-49df-9b06-50e1d6b36c8f commented 8 years ago

BPO	28180
Nosy	@malemburg, @warsaw, @ronaldoussoren, @ncoghlan, @vstinner, @ned-deily, @mcepl, @ezio-melotti, @bitdancer, @methane, @4kir4, @xdegaye, @yan12125, @Vgr255
PRs	python/cpython#659 python/cpython#2130 python/cpython#2155 python/cpython#2208 python/cpython#4334
Dependencies	bpo-30565: PEP 538: silence locale coercion and compatibility warnings by default? bpo-30635: Leak in test_c_locale_coercion bpo-30647: CODESET error on AMD64 FreeBSD 10.x Shared 3.x caused by the PEP 538
Files	fedora-cpython-force-c-utf-8.diff: Downstream patch currently proposed for Fedora 26 fedora-cpython-PYTHONALLOWCLOCALE.diff: Draft Fedora 26 patch as at 2016-12-18 pep538_coerce_legacy_c_locale.diff: Initial patch for PEP 538 (targeting 3.7) pep538_coerce_legacy_c_locale_v2.diff: Add test cases for handling of unknown locales pep538-check-click.sh: Utility script to check click's behaviour in a PEP 538 patched CPython pep538_coerce_legacy_c_locale_v3.diff: Refactor PEP 538 test cases to cover no locale setting, C locale, POSIX locale and unknown locale android_setlocale.patch pep538_coerce_legacy_c_locale.patch: Ufinished attempt to port this patch to Python 3.4

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/ncoghlan' closed_at = created_at = labels = ['type-bug', '3.7', 'expert-unicode'] title = 'Implementation of the PEP 538: coerce C locale to C.utf-8' updated_at = user = 'https://bugs.python.org/JanNiklasHasse' ``` bugs.python.org fields: ```python activity = actor = 'mcepl' assignee = 'ncoghlan' closed = True closed_date = closer = 'ncoghlan' components = ['Unicode'] creation = creator = 'Jan Niklas Hasse' dependencies = ['30565', '30635', '30647'] files = ['45907', '45951', '46059', '46121', '46190', '46205', '46329', '48991'] hgrepos = [] issue_num = 28180 keywords = ['patch'] message_count = 89.0 messages = ['276693', '276694', '276707', '276709', '276722', '276729', '277273', '277274', '282964', '282965', '282970', '282971', '282972', '282977', '282978', '282984', '283244', '283408', '283409', '283469', '283471', '283482', '283495', '283515', '283543', '283732', '284150', '284170', '284176', '284537', '284605', '284620', '284621', '284631', '284641', '284647', '284697', '284716', '284718', '284719', '284720', '284722', '284725', '284729', '284736', '284742', '284747', '284764', '284782', '284794', '284795', '284799', '284882', '284884', '284886', '284887', '284900', '284908', '284943', '284952', '285735', '286001', '289002', '289534', '295121', '295683', '295688', '295698', '295710', '295713', '295722', '295871', '295872', '295875', '295885', '295913', '295914', '296064', '296075', '296077', '305850', '306108', '314627', '364740', '364760', '364767', '364770', '364804', '364810'] nosy_count = 16.0 nosy_names = ['lemburg', 'barry', 'ronaldoussoren', 'ncoghlan', 'vstinner', 'ned.deily', 'mcepl', 'ezio.melotti', 'r.david.murray', 'methane', 'akira', 'Sworddragon', 'xdegaye', 'yan12125', 'abarry', 'Jan Niklas Hasse'] pr_nums = ['659', '2130', '2155', '2208', '4334'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue28180' versions = ['Python 3.7'] ```

2c0f8f82-20db-49df-9b06-50e1d6b36c8f commented 8 years ago

Working with Docker I often end up with an environment where the locale isn't correctly set. In these cases it would be great if sys.getfilesystemencoding() could default to 'utf-8' instead of 'ascii', as it's the encoding of the future and ascii is a subset of it anyway.

cd293b3e-6d38-412b-8370-a46a9aaee518 commented 8 years ago

This is a duplicate of bpo-27781.

vstinner commented 8 years ago

This is a duplicate of bpo-27781.

bpo-27781 is specific to Windows. I'm not sure that it's the base in this issue. So I reopen the issue.

@Jan Niklas Hasse: What is your OS?

I proposed to add "-X utf8" command line option for UNIX to force utf8 encoding. Would it work for you?

2c0f8f82-20db-49df-9b06-50e1d6b36c8f commented 8 years ago

Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with

!/usr/bin/env python3

would it?

bitdancer commented 8 years ago

I thought we "fixed" this by using surrogate escape when the locale was ASCII? We certainly have discussed changing the default and posix and so far have decided not to (someday that will change...is this someday already?)

vstinner commented 8 years ago

is this someday already?)

Not yet :-)

2c0f8f82-20db-49df-9b06-50e1d6b36c8f commented 8 years ago

Why not?

methane commented 8 years ago

I want locale free Python which behaves like on C.UTF-8 locale. (stdio encoding, preferred encoding, weekday in _strptime._strptime, and more maybe)

But Python 3.6 is feature freeze already >_\<;;

ncoghlan commented 7 years ago

I think we're genuinely getting to the point now where the majority of "LANG=C" cases are misconfigurations rather than intended behaviour. We're also to the point where:

on Mac OS X, binary system interfaces have been handled as UTF-8 by default since 3.0
on Windows, as of 3.6, the OS native binary system interfaces are now bypassed entirely in favour of transcoding from UTF-8 to UTF-16-LE

So I think for Python 3.7 it makes sense to do the following on other *nix systems:

very early in CPython startup (even before argument processing), if the detected locale is "C", force it to "C.UTF-8" if possible, and print a warning either way
add a PYTHONKEEPASCIILOCALE environment variable to turn that behaviour off

I do think we actually want to *change* the C level locale in the process though, as otherwise we can expect to see weird interactions where CPython and extension modules disagree about the default text encoding.

ncoghlan commented 7 years ago

Note also that if we say we're going to do this for 3.7, *and* go ahead and implement it, then distros may be more inclined to incorporate the same behavioural changes into distro-provided releases of 3.6, providing real world testing of the concept before we make it the default behaviour.

2c0f8f82-20db-49df-9b06-50e1d6b36c8f commented 7 years ago

Actually in a new Docker container, the LANG variable isn't set at all. Defaulting to UTF-8 in that case should be easier to reason about, shouldn't it?

ncoghlan commented 7 years ago

From CPython's point of view, glibc behaves the same way (i.e. reporting ascii as the preferred encoding for operating system interfaces) regardless of whether the cause is the locale not being set at all, or due to it being explicitly set to the legacy POSIX locale via LANG=C.

2c0f8f82-20db-49df-9b06-50e1d6b36c8f commented 7 years ago

https://sourceware.org/glibc/wiki/Proposals/C.UTF-8#Defaults mentions that C.UTF-8 should be glibc's default.

This bug report also mentions Python: https://sourceware.org/bugzilla/show_bug.cgi?id=17318 It hasn't been fixed yet, though :/

malemburg commented 7 years ago

If we just restrict this to the file system encoding (and not the whole LANG setting), how about:

default the file system encoding to 'utf-8' and use the surrogate escape handler as default error handler
add a PYTHONFSENCODING env var to set the file system encoding to something else (*)

(*) I believe we discussed this at some point already, but don't remember the outcome.

Regarding the questions of defaulting to LANG=C.UTF-8: I think this needs some more thought, since it would also affect many C locale aware functions. To make this work, Python would have to call setlocale() early on in the startup phase to adjust the C lib accordingly.

methane commented 7 years ago

Sorry for confusing. I didn't meant defaulting LANG=C.UTF-8.

I meant use UTF-8 as default fsencoding, stdioencoding regardless locale, and locale.getpreferredencoding() returns 'utf-8' when LC_CTYPE is ascii.

ncoghlan commented 7 years ago

The challenge that arises in being selective about this is that "sys.getfilesystemencoding()" is actually a misnomer, and some of the things we use it for (like decoding command line arguments and environment variables) necessarily happen *really* early in the interpreter bootstrapping process. The bugs that arise from being internally inconsistent are then even harder to debug than those that arise from believing the OS when it says the right encoding to use is ASCII - the latter at least don't tend to be subtle, and are amenable to being resolved via "LC_ALL=C.UTF-8" and "LANG=C.UTF-8".

I believe Victor put quite a bit of time into trying to get more selective approaches to work reliably and eventually gave up.

For Fedora 26, I'm going to explore the feasibility of patching our system 3.6 installation such that the python3 command itself (rather than the shared library) checks for "LC_CTYPE=C" as almost the first thing it does, and forcibly sets LANG and LC_ALL to C.UTF-8 if it gets an answer it doesn't like. If we're able to do that successfully in the more constrained environment of a specific recent Fedora release, then I think it will bode well for doing something similar by default in CPython 3.7

ncoghlan commented 7 years ago

Downstream Fedora issue proposing the above idea for F26: https://bugzilla.redhat.com/show_bug.cgi?id=1404918

I've also attached the patch from that issue here.

vstinner commented 7 years ago

Victor>> I proposed to add "-X utf8" command line option for UNIX to force utf8 encoding. Would it work for you?

Jan Niklas Hasse> Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with "#!/usr/bin/env python3" would it?

Usually, when a new option is added to Python, we add a command line option (-X utf8) but also an environment variable: I propose PYTHONUTF8=1.

Use your favorite method to define the env var "system wide" in your docker containers.

Note: Technically, I'm not sure that it's possible to support -E option with PYTHONUTF8, since -E comes from the command line, and we first need to decode command line arguments with an encoding to parse these options.... Chicken-and-egg issue ;-)

vstinner commented 7 years ago

I believe Victor put quite a bit of time into trying to get more selective approaches to work reliably and eventually gave up.

Yeah, it just doesn't work to use more than one encoding per process. You should use the same encoding for the whole lifetime of a process.

If you decode early data from an encoding A and later encode it back to encoding B, you get mojibake. The problem is simple.

Using more than one encoding per process means starting to make assumtpions on how data is used. For example, consider that environment variables use the encoding A, but filenames should use the encoding B. Or, but what if an environment variable contains a filename? Similar issues for command line arguments, subprocess pipes, standard streams (sys.std*), etc.

ncoghlan commented 7 years ago

We've been discussing this further downstream in the Fedora Python SIG, and we have a draft approach that we're pretty sure will work for us (based in turn on the approach Armin Ronacher came up with for click), and we think it should work for other distros as well (as long as they already ship the C.UTF-8 locale, and if they don't, they should fix that limitation anyway).

So I'm assigning this to myself as I think the next step will be to write a PEP that both proposes the specific idea as the default behaviour in 3.7, and also encourages distros to opt-in to trialling it as a downstream patch for 3.6.

ncoghlan commented 7 years ago

Making an explicit note of this so I remember to mention it in the draft PEP: one of the biggest problems that arises in any attempt at a Python-only solution to overriding the locale is that we can end up disagreeing with C/C++ extensions, and this is *especially* a problem when sharing a process with GUI frameworks like Tcl/Tk, Qt, and GTK (since they tend to read the process-wide settings, rather than querying anything that CPython configures during normal operation).

So the approach I'm proposing is to implement a C->C.UTF-8 locale override in the *actual python CLI executable*, and then in the dynamically linked library we only emit a warning if we detect the C locale, we don't actually do anything to change it.

malemburg commented 7 years ago

On 17.12.2016 08:56, Nick Coghlan wrote:

Making an explicit note of this so I remember to mention it in the draft PEP: one of the biggest problems that arises in any attempt at a Python-only solution to overriding the locale is that we can end up disagreeing with C/C++ extensions, and this is *especially* a problem when sharing a process with GUI frameworks like Tcl/Tk, Qt, and GTK (since they tend to read the process-wide settings, rather than querying anything that CPython configures during normal operation).

Another use case to consider is embedding the Python interpreter in another application. In such situations, the C locale will usually already be set by the main application and it may conflict with the LANG or other locale env var settings, since the user may have chosen to use a different locale in the context of the application.

ncoghlan commented 7 years ago

On 17 December 2016 at 20:15, Marc-Andre Lemburg \report@bugs.python.org\ wrote:

Another use case to consider is embedding the Python interpreter in another application. In such situations, the C locale will usually already be set by the main application and it may conflict with the LANG or other locale env var settings, since the user may have chosen to use a different locale in the context of the application.

Aye, that's the origin of the split proposal to only emit a warning in the shared library (since CPython might only be a piece of a larger application), but implement actual locale coercion (by overriding LANG and LC_ALL in the process environment) in the command line app's main() function (as in that case we know CPython *is* the application).

The hard part of writing the PEP isn't really going to be explaining the proposal itself (I expect it to be around a 20 line patch to the C code) - it's going to be explaining why all the other possibilities we've considered over the years don't work, and why we (as in the Fedora Python SIG) think this one actually stands a chance of working properly :)

2c0f8f82-20db-49df-9b06-50e1d6b36c8f commented 7 years ago

Usually, when a new option is added to Python, we add a command line option (-X utf8) but also an environment variable: I propose PYTHONUTF8=1.

Use your favorite method to define the env var "system wide" in your docker containers.

This doesn't help me, as I already set LANG to C.utf-8.

I'm rather thing about new people trying out Python in Docker who don't know about this.

Furthermore I think that UTF-8 is the future and the use of ASCII should be discouraged.

ncoghlan commented 7 years ago

For folks not following the Fedora BZ issue directly, I've also attached the latest draft downstream patch here, which gives the following behaviour:

\==========================

$ ./python -c "import sys; print(sys.getfilesystemencoding())"
utf-8

$ LANG=C.UTF-8 ./python -c "import sys; print(sys.getfilesystemencoding())"
utf-8

$ LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())"
Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set PYTHONALLOWCLOCALE to disable this behaviour).
utf-8

$ PYTHONALLOWCLOCALE=1 LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())"
Python detected LC_CTYPE=C, but PYTHONALLOWCLOCALE is set. Some libraries, applications, and operating system interfaces may not work correctly.
Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some libraries and operating system interfaces may not work correctly. Use `PYTHONALLOWCLOCALE=1 LC_CTYPE=C python3` to configure a similar environment when running Python directly.
ascii

==========================

(The double warning in the last example is likely to go away by skipping the CLI level warning in that case)

The Python tests checking for the expected behaviour are signficantly longer than the C level changes needed to implement it :)

vstinner commented 7 years ago

Previous related work:

changeset: 89836:bc06f67234d0 user: Victor Stinner \victor.stinner@gmail.com\ date: Tue Mar 18 01:18:21 2014 +0100 files: Doc/whatsnew/3.5.rst Lib/test/test_sys.py Misc/NEWS Python/pythonru description: Issue bpo-19977: When the ``LC_TYPE`` locale is the POSIX locale (``C`` locale), :py:data:`sys.stdin` and :py:data:`sys.stdout` are now using the ``surrogateescape`` error handler, instead of the ``strict`` error handler.

ncoghlan commented 7 years ago

I've now written this up as a PEP: https://github.com/python/peps/blob/master/pep-0538.txt

The latest attached patch implements the specific design proposed in the PEP. Relative to the last Fedora specific patch, this tweaks the warning message wording slightly, and only emits the library level warning when PYTHONALLOWCLOCALE is set:

\======================

$ LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())"
Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set PYTHONALLOWCLOCALE to disable this locale coercion behaviour).
utf-8

======================
$ PYTHONALLOWCLOCALE=1 LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())"
Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some libraries and operating system interfaces may not work correctly. Set `PYTHONALLOWCLOCALE=1 LC_CTYPE=C` to configure a similar environment when running Python directly.
ascii

2c0f8f82-20db-49df-9b06-50e1d6b36c8f commented 7 years ago

Only important case for me: What when LANG is unset?

ncoghlan commented 7 years ago

If nothing is configured (i.e. none of LC_ALL, LC_CTYPE or LANG are set in the environment), then C reports the locale as "C". It's probably worthwhile for me to add a Background section to the PEP that explains the behaviour of setlocale at the C level, as that's the source of the majority of the problems, as well as the key mechanism used to implement the locale coercion.

ncoghlan commented 7 years ago

Updated patch adds some tests showing that this change should also help with cases where SSH environment forwarding results in an unknown locale being requested in the server environment.

methane commented 7 years ago

I read PEP-538 but I can't understand why just using UTF-8 when locale is C like macOS is bad idea.

ncoghlan commented 7 years ago

On Mac OS X, the XCode libc already ignores the locale settings and just uses UTF-8 as the default text encoding, so the hardcoding in CPython aligns with that behaviour.

That isn't the case on other *nix systems - there, we need CPython to be consistent with the configured C/C++ locale, *and* we need it to be using something other than ASCII as the default encoding.

Answer: coerce the default locale from C to C.UTF-8 (if available), or to en_US.UTF-8 (for older distros that don't provide C.UTF-8). (The latter aspect isn't in the PEP yet, it's an improvement that came up in the linux-sig discussions: https://github.com/python/peps/issues/171 )

methane commented 7 years ago

That isn't the case on other *nix systems - there, we need CPython to be consistent with the configured C/C++ locale, *and* we need it to be using something other than ASCII as the default encoding.

Isn't using UTF-8 as filesystem encoding and stdin/stdout encoding consistent with C or POSIX locale?

Don't "modern" programming environments (Rust, Go, node.js) use UTF-8 even if locale is C or POSIX?

methane commented 7 years ago

I'm sorry. I must search old discussion about why we can't simply use utf-8 for fsencoding when C locale, instead of asking here.

ncoghlan commented 7 years ago

The default encoding in the C/POSIX locale is ASCII (which is the entire source of the problem).

The initial verison of the PEP I uploaded didn't explain that background, but I added a section about it in the update earlier this week: https://www.python.org/dev/peps/pep-0538/#background

vstinner commented 7 years ago

The default encoding in the C/POSIX locale is ASCII (which is the entire source of the problem).

The reality is more complex than that :-) It depends on the OS.

Some OS uses Latin1 for the POSIX locale. Some OS announces to use Latin1 for the POSIX locale, but use ASCII in practice :-) On these lying OS, Python decodes bytes 0x80..0xff using mbstowcs() to check if we get ASCII or Latin1: see the check_force_ascii() function.

/* Workaround FreeBSD and OpenIndiana locale encoding issue with the C locale. On these operating systems, nl_langinfo(CODESET) announces an alias of the ASCII encoding, whereas mbstowcs() and wcstombs() functions use the ISO-8859-1 encoding. The problem is that os.fsencode() and os.fsdecode() use locale.getpreferredencoding() codec. For example, if command line arguments are decoded by mbstowcs() and encoded back by os.fsencode(), we get a UnicodeEncodeError instead of retrieving the original byte string.

The workaround is enabled if setlocale(LC_CTYPE, NULL) returns "C", nl_langinfo(CODESET) announces "ascii" (or an alias to ASCII), and at least one byte in range 0x80-0xff can be decoded from the locale encoding. The workaround is also enabled on error, for example if getting the locale failed.

(...) */

methane commented 7 years ago

On Linux, I think most people wants UTF-8:surrogateescape by default, without fighting against locale and environment variables.

There are already #if defined(__APPLE__) || defined(__ANDROID__) path for it. How about adding configure option to use same logic? (say --with-encoding=(locale|utf-8), preferred encoding is changed in same way).

It may help many people building Python themselves without having root privilege for generating C.UTF-8 locale.

ncoghlan commented 7 years ago

Anything purely on the Python side of things doesn't work in a traditional C environment - CPython relies on the C lib to do conversions during startup, so we need the C locale to be set correctly. We can do things differently on Mac OS X and iOS because Apple ensure that *C* behaves differently on Mac OS X and iOS (and apparently Google do something similar for Android, so I'll update the PEP to mention that as well).

methane commented 7 years ago

Anything purely on the Python side of things doesn't work in a traditional C environment - CPython relies on the C lib to do conversions during startup, so we need the C locale to be set correctly.

What I propose is non't use mbstowcs, like __ANDROID__

wchar_t*
Py_DecodeLocale(const char* arg, size_t *size)
{
#if defined(__APPLE__) || defined(__ANDROID__)
    wchar_t *wstr;
    wstr = _Py_DecodeUTF8_surrogateescape(arg, strlen(arg));

On Linux, command line arguments and filepath is just a byte sequence. So using UTF-8:surrogateescape from during startup should works fine.

Am I wrong?

malemburg commented 7 years ago

On 05.01.2017 10:26, Nick Coghlan wrote:

Anything purely on the Python side of things doesn't work in a traditional C environment - CPython relies on the C lib to do conversions during startup, so we need the C locale to be set correctly. We can do things differently on Mac OS X and iOS because Apple ensure that *C* behaves differently on Mac OS X and iOS (and apparently Google do something similar for Android, so I'll update the PEP to mention that as well).

I believe IANADA-san (hope that's the right way to address him) raised a good point though: what if a system doesn't come with the C.UTF-8 local setup ?

The C lib would then error out when trying to use setlocale() on such an environment.

Now, Python's main() function doesn't look at any such errors (and neither do the other places which use it such as frozenmain.c and readline.c), so it wouldn't even notice.

The setlocal() man-page doesn't mention how such a failure would affect the current locale settings. My guess is that the locale remains set to what it was before, which in case of a fresh C application start is the "C" locale.

So in the implementation of the PEP, there should be a test to see whether "C.UTF-8" does result in a successful call to setlocale(). If it doesn't, there would have to be some work-around to still make Python's FS encoding happy while leaving the C lib locale set at "C".

methane commented 7 years ago

Why I want to add configure option to ignore locale is:

C.UTF-8 is not supported by RHEL7 (https://bugzilla.redhat.com/show_bug.cgi?id=1361965)

RHEL7 will be used for a long time. And many people uses new Python instead of distro's Python, via pyenv or pythonz. I feel deprecating C locale from Python 3.7 is bit aggressive.

Many admins like C locale.

locale setting will cause unintended side effects. So many admins dislike xx_XX.UTF-8 locale. For example (from https://fumiyas.github.io/2016/12/25/dislike.sh-advent-calendar.html ):

$ mkdir tmp
$ cd tmp
$ touch a b c x y z A B C X Y Z
$ LC_ALL=C /bin/bash --noprofile --norc -c 'echo [A-Z]'
A B C X Y Z
$ LC_ALL=en_US.UTF-8 /bin/bash --noprofile --norc -c 'echo [A-Z]'
A b B c C x X y Y z Z

Many other languages can use UTF-8 even when C locale

node.js, Ruby, Rust, Go can use UTF-8 on Linux People don't want to learn how to configure locale properly only for Python.

ncoghlan commented 7 years ago

No, requesting a locale that doesn't exist doesn't error out, because we don't check the return code - it just keeps working the same way it does now (i.e. falling back to the legacy C locale).

However, it would be entirely reasonable to put together a competing PEP proposing to eliminate the reliance on the problematic libc APIs, and instead use locale independent replacements. I'm simply not offering to implement or champion such a PEP myself, as I think ignoring the locale settings rather than coercing them to something more sensible will break integration with C/C++ GUI toolkits like Tcl/Tk, Gtk, and Qt, and it's reasonable for us to expect OS providers to offer at least one of C.UTF-8 or en_US.UTF-8 (see https://github.com/python/peps/issues/171 for more on that).

ncoghlan commented 7 years ago

The PEP already explains how other runtimes achieve UTF-8 and UTF-18-LE everywhere: by ignoring the C/C++ locale entirely. While this breaks integration with other C/C++ components, the developers of those languages and runtimes simply don't care, as they never supported integrating with those components in the first place.

CPython doesn't have that luxury, since it is used extensively in locale aware desktop applications.

vstinner commented 7 years ago

Sorry, I still didn't have enough time to read carefully the PEP-538. But since the discussion already started on this issue, I will add my comments:

I'm sure that many Linux, UNIX and BSD systems don't have the "C.UTF-8" locale. For example, HP-UX has "C.utf8" which is not exactly "C.UTF-8".
Setting the locale has an impact on all libraries running in the Python process. At this point, I'm not sure that it is what we want.
I'm not sure that it's ok in 2017 to always force the UTF-8 encoding if the user locale uses a different encoding. I had the same concern with the PEP-528 (Change Windows console encoding to UTF-8) and PEP-529 (Change Windows filesystem encoding to UTF-8) on Windows, but these PEPs were approved and merged into Python 3.6. My fear is obviously mojibake with the other applications using the other encoding, the locale encoding. Other applications are not impacted by setlocale() in the Python process.
I proposed an opt-in option to force UTF-8: -X utf8 command line option and PYTHONUTF8=1 env var. Opt-in will obviously reduce the risk of backward compatibility issues. With an opt-in option, users are better prepared for mojibake issues.
I dislike "Backporting to earlier Python 3 releases". In my experience, changes on how Python handles text (encodings, codecs, etc.) always have subtle issues, and users dislike getting backward incompatible changes in minor releases. *Maybe* if the option is an opt-in, the risk is lower and acceptable?
I dislike that Fedora has such downstream change. I would prefer to decide upstream how to convert UTF-8 slowly as a first-class citizen in Python. Otherwise, Fedora would behave differently than other Linux distributions and it can be painful to write applications having the same behaviour on all Linux distributions. But I also understand that Fedora has sometimes to move faster than the slow CPython project :-) Fedora can also seen as a toy to experiment changes quickly which helps to provide a wide feedback upstream to take better decision.
Using strict or surrogateescape error handler is a very important choice which has a wide impact. If we use utf8 by default (PEP-538), people will problably complain less if Python magically pass undecoded bytes thanks to the surrogateescape. If the option is an opt-in, strict may make sense. But surrogateescape is maybe still more "convenient". I don't know at this point.

Nick: it seems like you have a well defined plan. But I dislike on multiple points. I don't know if it's better to try to convince you to change your PEP, or write a different PEP.

I planned to write such "UTF-8" PEP since 2015, but I never started because the scope is so large that I fear all tiny but annoying corner cases...

malemburg commented 7 years ago

While going for the full locale setting may be a good option, perhaps just focusing on the FS encoding for now is a better way forward (and also more in line with the ticket title).

So essentially go for the PEP-529 approach on Unix as well (except that we use 'ascii' as fallback in legacy mode):

https://www.python.org/dev/peps/pep-0529/

The PEP also includes a section on affected modules, which we could double check (even though the term "FS encoding" implies that only file system relevant APIs are touched by such a change, the encoding is used in several other places as well):

https://www.python.org/dev/peps/pep-0529/#id14

For Windows, a couple of modules such as pwd and nis are not used, so those may need some extra attention.

ncoghlan commented 7 years ago

The trade-offs here are incredibly complex (and are mainly a matter of deciding whose code and configurations we want to break in 3.7+), so I think competing PEPs are going to be better than attempting to create a combined PEP that tries to cover all the options.

That way each PEP can argue as strongly as it can for the respective authors preferred approach to tackling the default C locale problem, even if they point to a common background section in one of the PEPs (similar to the way PEPs 522 and 524 shared a common problem definition, even though they proposed different ways of handling it).

warsaw commented 7 years ago

On Jan 05, 2017, at 11:11 AM, STINNER Victor wrote:

I'm sure that many Linux, UNIX and BSD systems don't have the "C.UTF-8" locale. For example, HP-UX has "C.utf8" which is not exactly "C.UTF-8".

I'm not sure that it's ok in 2017 to always force the UTF-8 encoding if the user locale uses a different encoding.

It's not just any different encoding, it's specifically C (implicitly, C.ASCII).

I proposed an opt-in option to force UTF-8: -X utf8 command line option and PYTHONUTF8=1 env var. Opt-in will obviously reduce the risk of backward compatibility issues. With an opt-in option, users are better prepared for mojibake issues.

If this is true, then I would like a configuration option to default this on. As mentioned, Debian and Ubuntu already have C.UTF-8 and most environments (although not all, see my sbuild/schroot comment earlier) will at least be C.UTF-8. Perhaps it doesn't matter then, but what I really want is that for those few odd outliers (e.g. schroot), Python would act the same inside and out those environments. I really don't want people to have to add that envar or switch (or even export LC_ALL) to get proper build behavior.

vstinner commented 7 years ago

That way each PEP can argue as strongly as it can for the respective authors preferred approach to tackling the default C locale problem, even if they point to a common background section in one of the PEPs (similar to the way PEPs 522 and 524 shared a common problem definition, even though they proposed different ways of handling it).

Ok, same players play again: as PEP 522/524 with Nick and me, I just wrote the PEP-540 "Add a new UTF-8 mode" and Nick wrote the PEP-538 :-D

I started a thread to discuss the PEP on python-ideas: https://mail.python.org/pipermail/python-ideas/2017-January/044089.html

IMHO the PEP-538 should discuss the usage of the surrogateescape error handler: see my second mail in the thread for the details.

I proposed a change in my 3rd mail which would move my PEP closer to Nick's PEP-538: enable "automatically" the UTF-8 mode when the locale is POSIX.

vstinner commented 7 years ago

Working with Docker I often end up with an environment where the locale isn't correctly set.

The locale encoding is controlled by 3 environment variables: LC_ALL, LC_CTYPE and LANG. https://www.python.org/dev/peps/pep-0540/#the-posix-locale-and-its-encoding

Can you please tell me if these variables are set and if yes, give me their value?

I would like to know if it would be possible to change the behaviour of Python when the (LC_CTYPE) locale is POSIX (aka the famous "C" locale).

ncoghlan commented 7 years ago

Docker containers don't have a locale set by default - the approach proposed in PEP-528 actually comes from the way I configure Docker images (which in turn comes from Armin Ronacher's recommendations in click for Python 3 locale handling).

In the Dockerfile for Fedora based containers I add:

ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8

while in CentOS 7 based containers I add:

ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8

And with those settings, Python 3 based containers just work (my laptop is running en_AU.UTF-8 locally)

python / cpython

Implementation of the PEP 538: coerce C locale to C.utf-8 #72367

!/usr/bin/env python3