python / cpython

The Python programming language
https://www.python.org
Other
62.42k stars 29.97k forks source link

Many regtest failures on Windows with non-ASCII account name #89339

Open a8734683-1b66-4dfd-ba9e-5d023a9cd806 opened 3 years ago

a8734683-1b66-4dfd-ba9e-5d023a9cd806 commented 3 years ago
BPO 45176
Nosy @pfmoore, @vstinner, @tjguk, @ezio-melotti, @zware, @serhiy-storchaka, @eryksun, @zooba

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = 'https://github.com/serhiy-storchaka' closed_at = None created_at = labels = ['type-bug', 'OS-windows', '3.10', '3.11', 'tests', 'expert-unicode', '3.9'] title = 'Many regtest failures on Windows with non-ASCII account name' updated_at = user = 'https://bugs.python.org/minghua' ``` bugs.python.org fields: ```python activity = actor = 'steve.dower' assignee = 'serhiy.storchaka' closed = False closed_date = None closer = None components = ['Tests', 'Unicode', 'Windows'] creation = creator = 'minghua' dependencies = [] files = [] hgrepos = [] issue_num = 45176 keywords = [] message_count = 9.0 messages = ['401659', '401663', '401665', '401667', '402250', '402288', '402305', '402314', '402324'] nosy_count = 9.0 nosy_names = ['paul.moore', 'vstinner', 'tim.golden', 'ezio.melotti', 'zach.ware', 'serhiy.storchaka', 'eryksun', 'steve.dower', 'minghua'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue45176' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11'] ```

a8734683-1b66-4dfd-ba9e-5d023a9cd806 commented 3 years ago

Background: Since at least Windows 8, it is possible to invoke the input method engine (IME) when installing Windows and creating accounts. So at least among simplified Chinese users, it's not uncommon to have a Chinese account name.

Issue: After successful installation using the 64-bit .exe installer for Windows, just to be paranoid (and to get familiar with Python's test framework), I decided to run the bundled regression tests. To my surprise I got many failures. The following is the summary of "python.exe -m test" with 3.8 some months ago (likely 3.8.6):

371 tests OK.

11 tests failed: test_cmd_line_script test_compileall test_distutils test_doctest test_locale test_mimetypes test_py_compile test_tabnanny test_urllib test_venv test_zipimport_support

43 tests skipped: test_asdl_parser test_check_c_globals test_clinic test_curses test_dbm_gnu test_dbm_ndbm test_devpoll test_epoll test_fcntl test_fork1 test_gdb test_grp test_ioctl test_kqueue test_multiprocessing_fork test_multiprocessing_forkserver test_nis test_openpty test_ossaudiodev test_pipes test_poll test_posix test_pty test_pwd test_readline test_resource test_smtpnet test_socketserver test_spwd test_syslog test_threadsignals test_timeout test_tix test_tk test_ttk_guionly test_urllib2net test_urllibnet test_wait3 test_wait4 test_winsound test_xmlrpc_net test_xxtestfuzz test_zipfile64

Total duration: 59 min 49 sec Tests result: FAILURE

The failures all look similar though, it seems Python on Windows assumes the home directory of the user, "C:\Users\\<username>\", is either in ASCII or UTF-8 encoding, while it is actually in Windows native codepage, in my case cp936 for simplified Chinese (zh-CN).

To take a couple of examples (these are from recent testing with 3.10.0 rc2):

python.exe -m test -W test_cmd_line_script 0:00:03 Run tests sequentially 0:00:03 [1/1] test_cmd_line_script [...] test_consistent_sys_path_for_direct_execution (test.test_cmd_line_script.CmdLineTest) ... ERROR [...] test_directory_error (test.test_cmd_line_script.CmdLineTest) ... FAIL [...] ERROR: test_consistent_sys_path_for_direct_execution (test.test_cmd_line_script.CmdLineTest) ----------------------------------------------------------------------

Traceback (most recent call last):
File "C:\Programs\Python\python310\lib\test\test_cmd_line_script.py", line 677, in test_consistent_sys_path_for_direct_execution
out_by_name = kill_python(p).decode().splitlines()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 9: invalid start byte
[...]
FAIL: test_directory_error (test.test_cmd_line_script.CmdLineTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Programs\Python\python310\lib\test\test_cmd_line_script.py", line 268, in test_directory_error
self._check_import_error(script_dir, msg)
File "C:\Programs\Python\python310\lib\test\test_cmd_line_script.py", line 151, in _check_import_error
self.assertIn(expected_msg.encode('utf-8'), err)
AssertionError: b"can't find '__main__' module in 'C:\\\\Users\\\\\xe5<5 bytes redacted>\\\\AppData\\\\Local\\\\Temp\\\\tmpcwkfn9ct'" not found in b"C:\\Programs\\Python\\python310\\python.exe: can't find '__main__' module in 'C:\\\\Users\\\\\xbb<3 bytes redacted>\\\\AppData\\\\Local\\\\Temp\\\\tmpcwkfn9ct'\r\n"
[...]

Ran 44 tests in 29.769s

FAILED (failures=2, errors=5) test test_cmd_line_script failed test_cmd_line_script failed (5 errors, 2 failures) in 30.4 sec

== Tests result: FAILURE ==

In the above test_directory_error AssertionError message I redacted part of the path as my account name is my real name. Hope the issue is clear enough despite the redaction, since the "\xe5\<5 bytes redacted>" part is 6 bytes and apparently in UTF-8 (for two Chinese characters) and the "\xbb\<3 bytes redacted>" part is 4 bytes and apparently in cp936.

Postscript: As I've said above, I discovered this issue some time ago, but only have time now to report it. I believe I've see these failures in 3.8.2/6, 3.9.7, and 3.10.0 rc2. It shouldn't be hard to reproduce for people with ways to create account with non-ASCII name on Windows. If reproducing turns out to be difficult though, I'm happy to provide more information and/or run more tests.

eryksun commented 3 years ago

In Windows, the standard I/O encoding of the spawn_python() child defaults to the process active code page, i.e. GetACP(). In Windows 10, the active code page can be set to UTF-8 at the system or application level, but most systems and applications still use a legacy code page. Python's default can be overridden to UTF-8 for standard I/O via PYTHONIOENCODING, or for all I/O via PYTHONUTF8 or "-X utf8=1". I would recommend using one of these UTF-8 options instead of trying to make a test work with the legacy code page. There is no guarantee, and should be no guarantee, that a filesystem path, which is Unicode, can be encoded using a legacy code page.

a8734683-1b66-4dfd-ba9e-5d023a9cd806 commented 3 years ago

Eryk Sun (eryksun) posted:

Python's default can be overridden to UTF-8 for standard I/O via PYTHONIOENCODING, or for all I/O via PYTHONUTF8 or "-X utf8=1".

FWIW, I did test with "-X utf8" option and it wasn't any better. Just tested "python.exe -X utf8=1 -m test -W test_cmd_line_script" with 3.10.0 rc2 again, and got 6 errors and 2 failures this way (1 more error than without "-X utf8=1"). There is also this new error message:

0:00:01 [1/1] test_cmd_line_script
Warning -- Uncaught thread exception: UnicodeDecodeError
Exception in thread Thread-60 (_readerthread):
Traceback (most recent call last):
  File "C:\Programs\Python\python310\lib\threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "C:\Programs\Python\python310\lib\threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Programs\Python\python310\lib\subprocess.py", line 1494, in _readerthread
    buffer.append(fh.read())
  File "C:\Programs\Python\python310\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 69: invalid start byte
eryksun commented 3 years ago

FWIW, I did test with "-X utf8" option

I was suggesting to modify the tests to use the UTF-8 mode option in the spawn_python() command line. It doesn't help to run the parent process in UTF-8 mode since it isn't inherited. It could be inherited via PYTHONUTF8, but it turns out that environment variables won't help in this case due to the use of the -E and -I command-line options.

zooba commented 3 years ago

I'd guess that these tests are assuming that sys.executable contains only ASCII characters. All the tests run in a non-ASCII working directory, so it's only the runtime that is not tested propersy here.

The easiest way for Ming Hua to test this is to install for all users (into Program Files), and run tests with the same user account.

If that's the case, we probably have to just go through the tests and make them Unicode-aware.

a8734683-1b66-4dfd-ba9e-5d023a9cd806 commented 3 years ago

Steve Dower (steve.dower) posted:

I'd guess that these tests are assuming that sys.executable contains only ASCII characters. All the tests run in a non-ASCII working directory, so it's only the runtime that is not tested propersy here.

The easiest way for Ming Hua to test this is to install for all users (into Program Files), and run tests with the same user account.

I've already installed for all users, just not into the default "C:\Program Files\", but instead "C:\Programs\Python\". I don't think it's the executable's path that is problematic, but the temporary directory where the tests are run (%LOCALAPPDATA%\Temp\tmpcwkfn9ct, where %LOCALAPPDATA% is C:\Users\\<account name>\AppData\Local and therefore contains non-ASCII characters).

Both of these paths are shown in the error/failure logs posted in the first message.

I doubt installing into "C:\Program Files\" would make a difference.

serhiy-storchaka commented 3 years ago

Not only sys.executable. Sources of non-ASCII paths:

The last one is the most common in these failures.

Tests fail when a non-ASCII path is written to the stdout or a file with the default encoding (which differs from the filesystem encoding) and then read with implying:

Fixing tests is not enough, because it is often an issue of scripts which write paths to the stdout. This problem does not have simple and general solution.

eryksun commented 3 years ago

I see no problem with changing a test -- such as test_consistent_sys_path_for_direct_execution() -- to spawn the child interpreter with -X utf8 when the I/O encoding itself is irrelevant to the test -- except for forcing a common Unicode encoding to ensure the integrity of test data (i.e. no mojibake) and prevent encoding/decoding failures.

zooba commented 3 years ago

I've already installed for all users, just not into the default "C:\Program Files\", but instead "C:\Programs\Python\"

Ah yes, that indeed rules out my first suspicion.

Fixing tests is not enough, because it is often an issue of scripts which write paths to the stdout.

Sure, but the ones currently failing here are ours, so we are the ones who need to fix them :) And they all seem to be in our test suite.

Fixing the tests doesn't make all the problems go away, just the specific ones we are responsible for on this issue.

vstinner commented 12 months ago

I fixed a few issues recently. What's the status of this issue on Python 3.13 (main branch)?

vstinner commented 12 months ago

See also issue GH-69368.