python / cpython

The Python programming language
https://www.python.org
Other
62.46k stars 29.98k forks source link

windows console doesn't print or input Unicode #45943

Closed a09e2537-b6b9-4978-9d1d-78b1db358cbc closed 8 years ago

a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 16 years ago
BPO 1602
Nosy @malemburg, @mhammond, @terryjreedy, @pfmoore, @amauryfa, @ncoghlan, @pitrou, @giampaolo, @tjguk, @mark-summerfield, @ned-deily, @ezio-melotti, @florentx, @4kir4, @lilydjwg, @berkerpeksag, @vadmium, @eryksun, @zooba, @davispuh
Superseder
  • bpo-28217: Add interactive console tests
  • Files
  • sys_write_stdout.patch
  • unicode2.py
  • doc-patch.diff: Proposed changes to user-visible documentation
  • unicode3.py
  • win_console.patch
  • test_win_console.py
  • streams.py
  • wincontest.py: Example io.TextIOWrapper sublcass using WideCharToMultiByte
  • winconsoleio.diff
  • 1602_2.patch
  • 1602_3.patch
  • 1602_4.patch
  • 1602_5.patch
  • 1602_6.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/zooba' closed_at = created_at = labels = ['type-bug', 'expert-unicode', 'OS-windows'] title = "windows console doesn't print or input Unicode" updated_at = user = 'https://github.com/mark-summerfield' ``` bugs.python.org fields: ```python activity = actor = 'THRlWiTi' assignee = 'steve.dower' closed = True closed_date = closer = 'steve.dower' components = ['Unicode', 'Windows'] creation = creator = 'mark' dependencies = [] files = ['19493', '20320', '20363', '23461', '23470', '23471', '36120', '40990', '44094', '44290', '44379', '44409', '44449', '44452'] hgrepos = [] issue_num = 1602 keywords = ['patch'] message_count = 148.0 messages = ['58487', '58621', '58651', '87086', '88059', '88077', '92854', '94445', '94480', '94483', '94496', '108173', '108228', '116801', '120414', '120415', '120416', '120700', '125823', '125824', '125826', '125833', '125852', '125877', '125889', '125890', '125898', '125899', '125938', '125942', '125947', '125956', '126286', '126288', '126303', '126304', '126308', '126319', '127782', '131657', '131854', '132060', '132061', '132062', '132064', '132065', '132067', '132184', '132191', '132208', '132266', '132268', '145898', '145899', '145963', '145964', '146471', '148990', '157569', '160812', '160813', '160897', '161151', '161153', '161308', '161651', '164572', '164578', '164580', '164618', '164619', '170899', '170915', '170999', '185135', '197700', '197751', '197752', '197773', '221175', '221178', '223403', '223404', '223507', '223509', '223945', '223946', '223947', '223948', '223949', '223951', '223952', '224019', '224086', '224095', '224596', '224605', '224690', '227329', '227330', '227332', '227333', '227337', '227338', '227347', '227354', '227373', '227374', '227441', '227450', '228191', '228208', '228210', '233347', '233350', '233916', '233937', '234019', '234020', '234096', '234371', '242884', '254405', '254407', '272596', '272605', '272645', '272662', '272675', '272716', '272718', '272720', '273999', '274449', '274673', '274884', '274906', '274912', '274939', '275003', '275004', '275005', '275157', '275362', '275510', '277047', '277048', '277050'] nosy_count = 38.0 nosy_names = ['lemburg', 'mhammond', 'terry.reedy', 'paul.moore', 'tzot', 'amaury.forgeotdarc', 'ncoghlan', 'pitrou', 'giampaolo.rodola', 'tim.golden', 'mark', 'ned.deily', 'christoph', 'ezio.melotti', 'v+python', 'hippietrail', 'flox', 'THRlWiTi', 'davidsarah', 'santoso.wijaya', 'akira', 'David.Sankel', 'python-dev', 'smerlin', 'lilydjwg', 'berker.peksag', 'martin.panter', 'piotr.dobrogost', 'eryksun', 'Drekin', 'steve.dower', 'wiz21', 'stijn', 'Jonitis', 'gurnec', 'escapewindow', 'dead1ne', 'davispuh'] pr_nums = [] priority = 'high' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = '28217' type = 'behavior' url = 'https://bugs.python.org/issue1602' versions = ['Python 3.6'] ```

    a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 16 years ago

    I am not sure if this is a Python bug or simply a limitation of cmd.exe.

    I am using Windows XP Home. I run cmd.exe with the /u option and I have set my console font to "Lucida Console" (the only TrueType font offered), and I run chcp 65001 to set the utf8 code page. When I run the following program:

    for x in range(32, 2000):
        print("{0:5X} {0:c}".format(x))

    one blank line is output.

    But if I do chcp 1252 the program prints up to 7F before hitting a unicode encoding error.

    This is different behaviour from Python 2.5.1 which (with a suitably modified print line) after chcp 65001 prints up to 7F and then fails with "IOError: [Errno 0] Error".

    a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 16 years ago

    I've looked into this a bit more, and from what I can see, code page 65001 just doesn't work---so it is a Windows problem not a Python problem. A possible solution might be to read/write UTF16 which "managed" Windows applications can do.

    tiran commented 16 years ago

    We are aware of multiple Windows related problems. We are planing to rewrite parts of the Windows specific API to use the widechar variants. Maybe that will help.

    pitrou commented 15 years ago

    Yes, it is a Windows problem. There simply doesn't seem to be a true Unicode codepage for command-line apps. Recommend closing.

    67e002fe-4fda-4dd4-b591-0911d0f7ec88 commented 15 years ago

    Just in case it helps, this behaviour is on Win XP Pro, Python 2.5.1:

    First, I added an alias for 'cp65001' to 'utf_8' in Lib/encodings/aliases.py .

    Then, I opened a command prompt with a bitmap font.

    c:\windows\system32>python
    Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
    (Intel)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print u"\N{EM DASH}"
    —

    I switched the font to Lucida Console, and retried (without exiting the python interpreter, although the behaviour is the same when exiting and entering again: )

    >>> print u"\N{EM DASH}"
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    IOError: [Errno 13] Permission denied

    Then I tried (by pressing Alt+0233 for é, which is invalid in my normal cp1253 codepage):

    >> print u"née"

    and the interpreter exits without any information. So it does for:

    >> a=u"née"

    Then I created a UTF-8 text file named 'test65001.py':

    # -*- coding: utf_8 -*-
    a=u"néeα"
    print a

    and tried to run it directly from the command line:

    c:\windows\system32>python d:\src\PYTHON\test65001.py
    néeαTraceback (most recent call last):
      File "d:\src\PYTHON\test65001.py", line 4, in <module>
        print a
    IOError: [Errno 2] No such file or directory

    You see? It printed all the characters before failing.

    Also the following works:

    c:\windows\system32>echo heéε heéε

    and

    c:\windows\system32>echo heéε >D:\src\PYTHON\dummy.txt

    creates successfully a UTF-8 file (without any UTF-8 BOM marks at the beginning).

    So it's possible that it is a python bug, or at least something can be done about it.

    amauryfa commented 15 years ago

    an immediate thing to do is to declare cp65001 as an encoding:

    Index: Lib/encodings/aliases.py \===================================================================

    --- Lib/encodings/aliases.py    (revision 72757)
    +++ Lib/encodings/aliases.py    (working copy)
    @@ -511,6 +511,7 @@
         'utf8'               : 'utf_8',
         'utf8_ucs2'          : 'utf_8',
         'utf8_ucs4'          : 'utf_8',
    +    'cp65001'            : 'utf_8',
     ## uu_codec codec
     #'uu'                 : 'uu_codec',

    This is not enough unfortunately, because the win32 API function WriteFile() returns the number of characters written, not the number of (utf8) bytes:

    >>> print("\u0124\u0102" + 'abc')
    ĤĂabc
    c
    [44420 refs]
    >>>

    Additionally, there is a bug in the ReadFile, which returns an empty string (and no error) when a non-ascii character is entered, which is the behavior of an EOF condition...

    Maybe the solution is to use the win32 console API directly...

    67e002fe-4fda-4dd4-b591-0911d0f7ec88 commented 15 years ago

    Another note: if one creates a dummy Stream object (having a softspace attribute and a write method that writes using os.write, as in http://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash/1432462#1432462 ) to replace sys.stdout and sys.stderr, then writes occur correctly, without issues. Pre-requisites: chcp 65001, Lucida Console font and cp65001 as an alias for UTF-8 in encodings/aliases.py This is Python 2.5.4 on Windows.

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 14 years ago

    With Python 3.1.1, the following batch file seems to be necessary to use UTF-8 successfully from an XP console:

    set PYTHONIOENCODING=UTF-8 cmd /u /k chcp 65001 set PYTHONIOENCODING= exit

    the cmd line seems to be necessary because of Windows having compatibility issues, but it seems that Python should notice the cp65001 and not need the PYTHONIOENCODING stuff.

    a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 14 years ago

    Glenn Linderman's fix pretty well works for me on XP Home. I can print every Unicode character up to and including U+D7FF (although most just come out as rectangles, at least I don't get encoding errors).

    It fails at U+D800 with message:

    UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 17: surrogates not allowed

    I also tried U+D801 and got the same error.

    Nonetheless, this is *much* better than before.

    malemburg commented 14 years ago

    Mark Summerfield wrote:

    Mark Summerfield \mark@qtrac.eu\ added the comment:

    Glenn Linderman's fix pretty well works for me on XP Home. I can print every Unicode character up to and including U+D7FF (although most just come out as rectangles, at least I don't get encoding errors).

    It fails at U+D800 with message:

    UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 17: surrogates not allowed

    I also tried U+D801 and got the same error.

    That's normal and expected: D800 is the start of the surrogate ranges which are only allows in pairs in UTF-8.

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 14 years ago

    The choice of the Lucida Consola or the Consolas font cures most of the rectangle problems. Those are just a limitation of the selected font for the console window.

    576fdecd-6e0f-4bb1-b761-7653a4759cf1 commented 14 years ago

    Will this bug be tackled or Python2.7?

    And is there a way to get hold of the access denied error?

    Here are my steps to reproduce:

    I started the console with "cmd /u /k chcp 65001"


    Aktive Codepage: 65001.

    C:\Dokumente und Einstellungen\root>set PYTHONIOENCODING=UTF-8

    C:\Dokumente und Einstellungen\root>d:

    D:\>cd Python31

    D:\Python31>python
    Python 3.1.2 (r312:79149, Mar 21 2010, 00:41:52) [MSC v.1500 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print("\u573a")
    场
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    IOError: [Errno 13] Permission denied
    >>>
    _______________________________________________________________________

    I see a rectangle on screen but obviously c&p works.

    vstinner commented 14 years ago

    Maybe the solution is to use the win32 console API directly...

    Yes, it is the best solution because it avoids the horrible mbcs encoding.

    About cp65001: it is not *exactly* the same encoding than utf-8 and so it cannot be used as an alias to utf-8: see issue bpo-6058.

    83d2e70e-e599-4a04-b820-3814bbdb9bef commented 14 years ago

    @Brian/Tim what's your take on this?

    vstinner commented 13 years ago

    I wrote a small function to call WriteConsoleOutputA() and WriteConsoleOutputW() in Python to do some tests. It works correclty, except if I change the code page using chcp command. It looks like the problem is that the chcp command changes the console code page and the ANSI code page, but it should only changes the ANSI code page (and not the console code page).

    chcp command \============

    The chcp command changes the console code page, but in practice, the console still expects the OEM code page (eg. cp850 on my french setup). Example:

    C:\...> python.exe -c "import sys; print(sys.stdout.encoding")
    cp850
    C:\...> chcp 65001
    C:\...> python.exe
    Fatal Python error: Py_Initialize: can't initialize sys standard streams
    LookupError: unknown encoding: cp65001
    C:\...> SET PYTHONIOENCODING=utf-8
    C:\...> python.exe
    >>> import sys
    >>> sys.stdout.write("\xe9\n")
    é
    2
    >>> sys.stdout.buffer.write("\xe9\n".encode("utf8"))
    é
    3
    >>> sys.stdout.buffer.write("\xe9\n".encode("cp850"))
    é
    2

    os.device_encoding(1) uses GetConsoleOutputCP() which gives 65001. It should maybe use GetOEMCP() instead? Or chcp command should be fixed?

    Set the console code page looks to be a bad idea, because if I type "é" using my keyboard, a random character (eg. U+0002) is displayed instead...

    WriteConsoleOutputA() and WriteConsoleOutputW() \===============================================

    Without touching the code page ------------------------------

    If the character can be rendered by the current font (eg. U+00E9): WriteConsoleOutputA() and WriteConsoleOutputW() work correctly.

    If the character cannot be rendered by the current font, but there is a replacment character (eg. U+0141 replaced by U+0041): WriteConsoleOutputA() cannot be used (U+0141 cannot be encoded to the code page), WriteConsoleOutputW() writes U+0141 but the console contains U+0041 (I checked using ReadConsoleOutputW()) and U+0041 is displayed. It works like the mbcs encoding, the behaviour looks correct.

    If the character cannot be rendered by the current font, but there is a replacment character (eg. U+042D): WriteConsoleOutputA() cannot be used (U+042D cannot be encoded to the code page), WriteConsoleOutputW() writes U+042D but U+003d (?) is displayed instead. The behaviour looks correct.

    chcp 65001 ----------

    Using "chcp 65001" command (+ "set PYTHONIOENCODING=utf-8" to avoid the fatal error), it becomes worse: the result depends on the font...

    Using raster font:

    Using Lucida (TrueType font):

    vstinner commented 13 years ago

    sys_write_stdtout.patch: Create sys.write_stdout() test function to call WriteConsoleOutputA() or WriteConsoleOutputW() depending on the input types (bytes or str).

    67e002fe-4fda-4dd4-b591-0911d0f7ec88 commented 13 years ago

    http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx

    If you want any kind of Unicode output in the console, the font must be an “official” MS console TTF (“official” as defined by the Windows version); I believe only Lucida Console and Consolas are the ones with all MS private settings turned on inside the font file.

    vstinner commented 13 years ago

    I don't understand exactly the goal of this issue. Different people described various bugs of the Windows console, but I don't see any problem with Python here. It looks like it's just not possible to display correctly unicode with the Windows console (the whole unicode charset, not the current code page subset).

    To me, there is nothing to do, and so I close the bug.

    If you would like to fix a particular Python bug, open a new specific issue. If you consider that I'm wrong, Python should fix this issue and you know how, please reopen it.

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    It is certainly possible to write Unicode to the console successfully using WriteConsoleW. This works regardless of the console code page, including 65001. The code \<a href="http://tahoe-lafs.org/trac/tahoe-lafs/browser/src/allmydata/windows/fixups.py"\>here\</a> does so (it's for Python 2.x, but you'd be calling WriteConsoleW from C anyway).

    WriteConsoleW has one bug that I know of, which is that it \<a href="http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1232"\>fails when writing more than 26608 characters at once\</a>. That's easy to work around by limiting the amount of data passed in a single call.

    Fonts are not Python's problem, but encoding is. It doesn't make sense to fail to output the right characters just because some users might not have selected fonts that can display those characters. This bug should be reopened.

    (For completeness, it is possible to display Unicode on the console using fonts other than Lucida Console and Consolas, but it \<a href="http://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash/3259271#3259271"\>requires a registry hack\</a>.)

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    Interesting!

    I was able to tweak David-Sarah's code to work with Python 3.x, mostly doing things that 2to3 would probably do: changing unicode() to str(), dropping u from u'...', etc.

    I skipped the unmangling of command-line arguments, because it produced an error I didn't understand, about needing a buffer protocol. But I'll attach David-Sarah's code + tweaks + a test case showing output of the Cyrillic alphabet to a console with code page 437 (at least, on my Win7-64 box, that is what it is).

    Nice work, David-Sarah. I'm quite sure this is not in a form usable inside Python 3, but it shows exactly what could be done inside Python 3 to make things work... and gives us a workaround if Python 3 is not fixed.

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    Glenn Linderman wrote:

    I skipped the unmangling of command-line arguments, because it produced an error I didn't understand, about needing a buffer protocol.

    If I understand correctly, that part isn't needed on Python 3 because bpo-2128 is already fixed there.

    vstinner commented 13 years ago

    It is certainly possible to write Unicode to the console successfully using WriteConsoleW

    Did you tried with characters not encodable to the code page and with character that cannot be rendeded by the font?

    See msg120414 for my tests with WriteConsoleOutputW.

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    haypo wrote:

    davidsarah wrote: > It is certainly possible to write Unicode to the console > successfully using WriteConsoleW

    Did you tried with characters not encodable to the code page and with character that cannot be rendeded by the font?

    Yes, characters not encodable to the code page do work (as confirmed by Glenn Linderman, since code page 437 does not include Cyrillic).

    Characters that cannot be rendered by the font print as missing-glyph boxes, as expected. They don't cause any other problem, and they can be cut-and-pasted to other Unicode-aware applications, showing up as the original characters.

    See msg120414 for my tests with WriteConsoleOutputW

    Even if it handled encoding correctly, WriteConsoleOutputW (http://msdn.microsoft.com/en-us/library/ms687404%28v=vs.85%29.aspx) would not be the right API to use in any case, because it prints to a rectangle of characters without scrolling. WriteConsoleW does scroll in the same way that printing to a console output stream normally would. (Redirection to a non-console stream can be detected and handled differently, as the code in unicode2.py does.)

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    I would certainly be delighted if someone would reopen this issue, and figure out how to translate unicode2.py to Python internals so that Python's console I/O on Windows would support Unicode "out of the box".

    Otherwise, I'll have to include the equivalent of unicode2.py in all my Python programs, because right now, I'm including instructions for the use to (1) choose Lucida or Consolas font if they can't figure out any other font that gets rid of the square boxes (2) chcp 65001 (3) set PYTHONIOENCODING=UTF-8

    Having this capability inside Python (or my programs) will enable me to eliminate two-thirds of the geeky instructions for my users. But it seems like a very appropriate capability to have within Python, especially Python 3.x with its preference and support Unicode in so many other ways.

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    I'll have a look at the Py3k I/O internals and see what I can do. (Reopening a bug appears to need Coordinator permissions.)

    tjguk commented 13 years ago

    Reopening as there seems to be some possibility of progress

    amauryfa commented 13 years ago

    The script unicode2.py uses the console STD_OUTPUT_HANDLE iff sys.stdout.fileno()==1. But is it always the case? What about pythonw.exe? Also some applications may redirect fd=1: I'm sure that py.test does this http://pytest.org/capture.html#setting-capturing-methods-or-disabling-capturing and IIRC Apache also redirects file descriptors.

    vstinner commented 13 years ago

    amaury> The script unicode2.py uses the console STD_OUTPUT_HANDLE iff amaury> sys.stdout.fileno()==1

    Interesting article about the Windows console: http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx

    There is an example which has many tests to check that stdout is the windows console (and not a pipe or something else).

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    The script unicode2.py uses the console STD_OUTPUT_HANDLE iff sys.stdout.fileno()==1.

    You may have missed "if not_a_console(hStdout): real_stdout = False". not_a_console uses GetFileType and GetConsoleMode to check whether that handle is directed to something other than a console.

    But is it always the case?

    The technique used here for detecting a console is almost the same as the code for IsConsoleRedirected at http://blogs.msdn.com/b/michkap/archive/2010/05/07/10008232.aspx , or in WriteLineRight at http://blogs.msdn.com/b/michkap/archive/2010/04/07/9989346.aspx (I got it from that blog, can't remember exactly which page).

    [This code will give a false positive in the strange corner case that stdout/stderr is redirected to a console *input* handle. It might be better to use GetConsoleScreenBufferInfo instead of GetConsoleMode, as suggested by http://stackoverflow.com/questions/3648711/detect-nul-file-descriptor-isatty-is-bogus/3650507#3650507 .]

    What about pythonw.exe?

    I just tested that, using pythonw run from cmd.exe with stdout redirected to a file; it works as intended. It also works (for both console and non-console cases) when the handles are inherited from a parent process.

    Incidentally, what's the earliest supported Windows version for Py3k? I see that http://www.python.org/download/windows/ mentions Windows ME. I can fairly easily make it fall back to never using WriteConsoleW on Windows ME, if that's necessary.

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    Note: Michael Kaplan's code checks whether GetConsoleMode failed due to ERROR_INVALID_HANDLE. My code intentionally doesn't do that, because it is correct and conservative to fall back to the non-console behaviour when there is *any* error from GetConsoleMode. (It could also fail due to not having the GENERIC_READ right on the handle, for example.)

    amauryfa commented 13 years ago

    Even if python.exe starts normally, py.test for example uses os.dup2() to redirect the file descriptors 1 and 2 to temporary files. sys.stdout.fileno() is still 1, the STD_OUTPUT_HANDLE did not change, but normal print() now goes to a file; but the proposed script won't detect this and will write to the console... Somehow we should extract the file handle from the file descriptor, with a call to _get_osfhandle() for example.

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    "... os.dup2() ..."

    Good point, thanks.

    It would work to change os.dup2 so that if its second argument is 0, 1, or 2, it calls _get_osfhandle to get the Windows handle for that fd, and then reruns the console-detection logic. That would even allow Unicode output to work after redirection to a different console.

    Programs that directly called the CRT dup2 or SetStdHandle would bypass this. Can we consider such programs to be broken? Methinks a documentation patch for os.dup2 would be sufficient, something like:

    "When fd1 refers to the standard input, output, or error handles (0, 1 and 2 respectively), this function also ensures that state associated with Python's initial sys.{stdin,stdout,stderr} streams is correctly updated if needed. It should therefore be used in preference to calling the C library's dup2, or similar APIs such as SetStdHandle on Windows."

    terryjreedy commented 13 years ago

    http://www.python.org/dev/peps/pep-0011/ says

    Name: Win9x, WinME, NT4 Unsupported in: Python 2.6 (warning in 2.5 installer) Code removed in: Python 2.6

    Only xp+ now. email sent to webmaster@...

    Even if the best fix only applies to win7, please include it.

    briancurtin commented 13 years ago

    I think we even agreed to drop 2000, although the PEP hasn't been updated and I couldn't find the supposed email where this was said.

    For implementing functionality that isn't supported on all Windows versions or architectures, you can look at PC/winreg.c for a few examples. DisableReflectionKey is a good example off the top of my head.

    vstinner commented 13 years ago

    Here are some results of my test of unicode2.py. I'm testing py3k on Windows XP, OEM: cp850, ANSI: cp1252.

    Raster fonts ------------

    With a fresh console, unicode2.py displays "?????????????????". input() accepts characters encodable to the OEM code page.

    If I set the code page to 65001 (chcp program+set PYTHONIOENCODING=utf-8; or SetConsoleCP() + SetConsoleOutputCP()), it displays weird characters. input() accepts ASCII characters, but non-ASCII characters (encodable to the console and OEM code pages) display weird characters (smileys! control characters?).

    Lucida console --------------

    With my system code page (OEM: cp850), characters not encodable to the code pages are displayed correctly. I can type some non-ASCII characters (encodable to the code page). If I copy/paste characters non encodable to the code page, there are replaced by similar glyph (eg. Ł => L) or ? (€ => ?).

    If I set the code page to 65001, all characters are still correctly displayed. But I cannot type non-ASCII characters anymore: input() fails with EOFError (I suppose that Python gets control characters).

    Redirect output to a pipe -------------------------

    I patched unicode2.py to use sys.stdout.buffer instead of sys.stdout for UnicodeOutput stream. I also patched UnicodeOutput to replace \n by \r\n.

    It works correctly with any character. No UTF-8 BOM is written. But "Here 1" is written at the end. I suppose that sys.stdout should be flushed before the creation of UnicodeOutput.

    But it always use UTF-8. I don't know if UTF-8 is well supported by any application on Windows.

    Without unicode2.py, only characters encodable to OEM code page are supported, and \n is used as end of line string.

    Let's try to summarize ----------------------

    Tests: d1) Display characters encodable to the console code page t1) Type characters encodable to the console code page d2) Display characters not encodable to any code page t2) Type characters not encodable to any code page

    I'm using Windows with OEM=cp850 and ANSI=cp1252. For test (t2), I copy €-Ł and paste it to the console (right click on the window title > Edit > Paste).

    Raster fonts, console=cp850:

    d1) ok t1) ok d2) FAIL: €-Ł is displayed ?-L t2) FAIL: €-Ł is read as ?-L

    Raster fonts, console=cp65001:

    d1) FAIL: é is displayed as 2 strange glyphs t1) FAIL: EOFError d2) FAIL: only display unreadable glyphs t2) FAIL: EOFError

    Lucida console, console=cp850:

    d1) ok t1) ok d2) ok t2) FAIL: €-Ł is read as ?-L

    Lucida console, console=cp65001:

    d1) ok t1) FAIL: EOFError d2) ok t2) FAIL: EOFError

    So, setting the console code page to 65001 doesn't solve any issue, but it breaks the input (input with the keyboard or pasting text).

    With Raster fonts or Lucida console, it's possible to display characters encodable to the code page. But it is not new, it's already possible with Python 3. But for characters not encodable to the code page, it works with unicode2.py and Lucida console, with is something new :-)

    For the input, I suppose that we need also to use a Windows console function, to support unencodable characters.

    vstinner commented 13 years ago

    ..., because right now, I'm including instructions for the use to (1) choose Lucida or Consolas font if they can't figure out any other font that gets rid of the square boxes (2) chcp 65001 (3) set PYTHONIOENCODING=UTF-8

    Why do you set the code page to 65001? In all my tests (on Windows XP), it always break the standard input.

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    Victor said: Why do you set the code page to 65001? In all my tests (on Windows XP), it always break the standard input.

    My response: Because when I searched Windows for Unicode and/or UTF-8 stuff, I found 65001, and it seems like it might help, and it does a bit. And then I find PYTHONIOENCODING, and that helps some. And that got me something that works better enough than what I had before, so I quit searching.

    You did a better job of analyzing and testing all the cases. I will have to go subtract the 65001 part, and confirm your results, maybe it is useless now that other pieces of the puzzle are in place. Certainly with David-Sarah's code it seems to not be needed, whether it was a necessary part of the previous workaround I am not sure, because of the limited number of cases I tried (trying to find something that worked well enough, but not having enough knowledge to find David-Sarah's solution, nor a good enough testing methodology to try the pieces independently.

    Thank your for your interest in this issue.

    0d272f2d-ac69-44ce-900d-8b7d0114cb9d commented 13 years ago

    remeber that cp65001 cannot be set on windows. Also please read http://blogs.msdn.com/b/michkap/archive/2010/10/07/10072032.aspx and contact the author, Michael Kaplan from Microsoft, if you have more questions. I'm sure he will be glad to help.

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    Feedback from Julie Solon of Microsoft:

    These console functions share a per-process heap that is 64K. There is some overhead, the heap can get fragmented, and calls from multiple threads all affect how much is available for this buffer.

    I am working to update the documentation for this function [WriteConsoleW] and other affected functions with information along these lines, and will post it within the next week or two.

    I replied thanking her and asking for clarification:

    When you say that the heap can get fragmented, is this true only when there are concurrent calls to the console functions, or can it occur even with single-threaded use? I'm trying to determine whether acquiring a process-global lock while calling these functions would be sufficient to ensure that the available heap space will not be unexpectedly low. (This assumes that the functions not used outside the lock by other libraries in the same process.)

    ReadConsoleW seems also to be affected, incidentally.

    I've asked for clarification about whether acquiring a process-global lock when using these functions ... Julie

    vstinner commented 13 years ago

    I did some tests with WriteConsoleW():

    Now I agree that WriteConsoleW() is the best solution to fix this issue.

    My test code (added to Python/sysmodule.c): ---------

    static PyObject *
    sys_write_stdout(PyObject *self, PyObject *args)
    {
        PyObject *textobj;
        wchar_t *text;
        DWORD written, total;
        Py_ssize_t len, chunk;
        HANDLE console;
        BOOL ok;
        if (!PyArg_ParseTuple(args, "U:write_stdout", &textobj))
            return NULL;
        console = GetStdHandle(STD_OUTPUT_HANDLE);
        if (console == INVALID_HANDLE_VALUE) {
            PyErr_SetFromWindowsErr(GetLastError());
            return NULL;
        }
    
        text = PyUnicode_AS_UNICODE(textobj);
        len = PyUnicode_GET_SIZE(textobj);
        total = 0;
        while (len != 0) {
            if (len > 10000)
                /* WriteConsoleW() is limited to 64 KB (32,768 UTF-16 units), but
                   this limit depends on the heap usage. Use a safe limit of 10,000
                   UTF-16 units.
                   http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1232 */
                chunk = 10000;
            else
                chunk = len;
            ok = WriteConsoleW(console, text, chunk, &written, NULL);
            if (!ok) 
                break;
            text += written;
            len -= written;
            total += written;
        }
        return PyLong_FromUnsignedLong(total);
    }

    The question is now how to integrate WriteConsoleW() into Python without breaking the API, for example:

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    (For anyone wondering about the hold-up on this bug, I ended up switching to Ubuntu. Not to worry, I now have Python 3 building in XP under VirtualBox -- which is further than I ever got with my broken Vista install :-/ It seems to behave identically to native XP as far as this bug is concerned.)

    Victor STINNER wrote:

    The question is now how to integrate WriteConsoleW() into Python without breaking the API, for example:

    • Should sys.stdout be a TextIOWrapper or not?

    It pretty much has to be a TextIOWrapper for compatibility. Also it's easier to implement it that way, because the text stream object has to be able to fall back to using the buffer if the fd is redirected.

    • Should sys.stdout.fileno() returns 1 or raise an error?

    Return sys.stdout.buffer.fileno(), which is 1 unless redirected.

    This is the Right Thing because in Windows, fds are an abstraction of the C runtime library, and the C runtime allows an fd to be associated with a console. In that case, from the application's point of view it is still writing to the same fd. In fact, we'd be implementing this by calling the WriteConsoleW win32 API directly in order to avoid bugs in the CRT's Unicode support, but that's an implementation detail.

    • What about sys.stdout.buffer: should sys.stdout.buffer.write() calls WriteConsoleA() or sys.stdout should not have a buffer attribute?

    I was thinking that sys.std{out,err}.buffer would still be set up exactly as they are now. Then if an app writes to that buffer, it will get interleaved with any writes via the text stream. (The writes to the buffer go to the underlying fd, which probably ends up calling WriteFile at the win32 level.)

    I think that many modules and programs now rely on sys.stdout.buffer to write directly bytes into stdout. There is at least python -m base64.

    That would just work. The only caveat would be that if you write a partial line to the buffer object (or if you set the buffer object to be fully buffered and write to it), and then write to the text stream, the buffer wouldn't be flushed before the text is written. I think that is fine as long as it is documented.

    If an app sets the .buffer attribute of sys.std{out,err}, it would fall back to using that buffer in the same way as when the fd is redirected.

    • Should we use ReadConsoleW() for stdin?

    Yes. I'll probably start with a patch that just handles std{out,err}, though.

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    I wrote:

    The only caveat would be that if you write a partial line to the buffer object (or if you set the buffer object to be fully buffered and write to it), and then write to the text stream, the buffer wouldn't be flushed before the text is written.

    Actually it looks like that already happens (because the sys.std{out,err} TextIOWrappers are line-buffered separately to their underlying buffers), so it would not be an incompatibility:

    $ python3 -c 'import sys; sys.stdout.write("foo"); sys.stdout.buffer.write(b"bar"); sys.stdout.write("baz\n")'
    barfoobaz
    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    I wrote: $ python3 -c 'import sys; sys.stdout.write("foo"); sys.stdout.buffer.write(b"bar"); sys.stdout.write("baz\n")' barfoobaz

    Hmm, the behaviour actually would differ here: the proposed implementation would print

    foobaz bar

    (the "foobaz\n" is written by a call to WriteConsoleW and then the "bar" gets flushed to stdout when the process exits).

    But since the naive expectation is "foobarbaz\n" and you already have to flush after each call in order to get that, I think this change in behaviour would be unlikely to affect correct applications.

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    Presently, a correct application only needs to flush between a sequence of writes and a sequence of buffer.writes.

    Don't assume the flush happens after every write, for a correct application.

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    Glenn Linderman wrote:

    Presently, a correct application only needs to flush between a sequence of writes and a sequence of buffer.writes.

    Right. The new requirement would be that a correct app also needs to flush between a sequence of buffer.writes (that end in an incomplete line, or always if PYTHONUNBUFFERED or python -u is used), and a sequence of writes.

    Don't assume the flush happens after every write, for a correct application.

    It's rather hard to implement this without any change in behaviour. Or rather, it isn't hard if the TextIOWrapper were to flush its underlying buffer before each time it writes to the console, but I'd be concerned about the extra overhead of that call. I'd prefer not to do that unless the new requirement above leads to incompatibilities in practice.

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    Would it suffice if the new scheme internally flushed after every buffer.write? It wouldn't be needed after write, because the correct application would already do one there?

    Am I off-base in supposing that the performance of buffer.write is expected to include a flush (because it isn't expected to be buffered)?

    vstinner commented 13 years ago

    Le vendredi 25 mars 2011 à 00:54 +0000, David-Sarah Hopwood a écrit :

    David-Sarah Hopwood \david-sarah@jacaranda.org\ added the comment:

    I wrote: $ python3 -c 'import sys; sys.stdout.write("foo"); sys.stdout.buffer.write(b"bar"); sys.stdout.write("baz\n")' barfoobaz

    Hmm, the behaviour actually would differ here: the proposed implementation would print

    foobaz bar

    (the "foobaz\n" is written by a call to WriteConsoleW and then the "bar" gets flushed to stdout when the process exits).

    But since the naive expectation is "foobarbaz\n" and you already have to flush after each call in order to get that, I think this change in behaviour would be unlikely to affect correct applications.

    I would not call this "naive". "foobaz\nbar" is really weird. I think that sys.stdout and sys.stdout.buffer will both have to flush after each write, or they may be desynchronized.

    Some developers already think that adding sys.stdout.flush() after print("Processing.. ", end='') is too hard (bpo-11633). So I cannot imagine how they would react if they will have to do it explicitly after all print, sys.stdout.write() and sys.stdout.buffer.write().

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    First a minor correction:

    The new requirement would be that a correct app also needs to flush between a sequence of buffer.writes (that end in an incomplete line, or always if PYTHONUNBUFFERED or python -u is used), and a sequence of writes.

    That should be "and only if PYTHONUNBUFFERED or python -u is not used".

    I also said:

    If an app sets the .buffer attribute of sys.std{out,err}, it would fall back to using that buffer in the same way as when the fd is redirected.

    but the .buffer attribute is readonly, so this case can't occur.

    Glenn Linderman wrote:

    Would it suffice if the new scheme internally flushed after every buffer.write? It wouldn't be needed after write, because the correct application would already do one there?

    Yes, that would be sufficient.

    Am I off-base in supposing that the performance of buffer.write is expected to include a flush (because it isn't expected to be buffered)?

    It is expected to be line-buffered. So an app might expect that printing characters one-at-a-time will have reasonable performance.

    In any case, given that the buffer of the initial std{out,err} will always be a BufferedWriter object (since .buffer is readonly), it would be possible for the TextIOWriter to test a dirty flag in the BufferedWriter, in order to check efficiently whether the buffer needs flushing on each write. I've looked at the implementation complexity cost of this, and it doesn't seem too bad.

    A similar issue arises for stdin: to maintain strict compatibility, every read from a TextIOWrapper attached to an input console would have to drain the buffer of its buffer object, in case the app has read from it. This is a bit tricky because the bytes drained from the buffer have to be converted to Unicode, so what happens if they end part-way through a multibyte character? Ugh, I'll have to think about that one.

    Victor STINNER wrote:

    Some developers already think that adding sys.stdout.flush() after print("Processing.. ", end='') is too hard (bpo-11633).

    IIUC, that bug is about the behaviour of 'print', and didn't suggest to change the fact that sys.stdout is line-buffered.

    By the way, are these changes going to be in a major release? If I understand correctly, the layout of structs (for standard library types not prefixed with '_', such as 'buffered' in bufferedio.c or 'textio' in textio.c) can change with major releases but not with minor releases, correct?

    64b73bf9-3de7-49cd-8c73-5a4198eb3429 commented 13 years ago

    I wrote:

    A similar issue arises for stdin: to maintain strict compatibility, every read from a TextIOWrapper attached to an input console would have to drain the buffer of its buffer object, in case the app has read from it. This is a bit tricky because the bytes drained from the buffer have to be converted to Unicode, so what happens if they end part-way through a multibyte character? Ugh, I'll have to think about that one.

    It seems like there is no correct way for an app to read from both sys.stdin, and sys.stdin.buffer (even without these console changes). It must choose one or the other.

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    David-Sarah said: In any case, given that the buffer of the initial std{out,err} will always be a BufferedWriter object (since .buffer is readonly), it would be possible for the TextIOWriter to test a dirty flag in the BufferedWriter, in order to check efficiently whether the buffer needs flushing on each write. I've looked at the implementation complexity cost of this, and it doesn't seem too bad.

    So if flush checks that bit, maybe TextIOWriter could just call buffer.flush, and it would be fast if clean and slow if dirty? Calling it at the beginning of a Text level write, that is, which would let the char-at-a-time calls to buffer.write be fast.

    And I totally agree with msg132191