python / cpython

The Python programming language
https://www.python.org
Other
63.46k stars 30.39k forks source link

windows console doesn't print or input Unicode #45943

Closed a09e2537-b6b9-4978-9d1d-78b1db358cbc closed 8 years ago

a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 16 years ago
BPO 1602
Nosy @malemburg, @mhammond, @terryjreedy, @pfmoore, @amauryfa, @ncoghlan, @pitrou, @giampaolo, @tjguk, @mark-summerfield, @ned-deily, @ezio-melotti, @florentx, @4kir4, @lilydjwg, @berkerpeksag, @vadmium, @eryksun, @zooba, @davispuh
Superseder
  • bpo-28217: Add interactive console tests
  • Files
  • sys_write_stdout.patch
  • unicode2.py
  • doc-patch.diff: Proposed changes to user-visible documentation
  • unicode3.py
  • win_console.patch
  • test_win_console.py
  • streams.py
  • wincontest.py: Example io.TextIOWrapper sublcass using WideCharToMultiByte
  • winconsoleio.diff
  • 1602_2.patch
  • 1602_3.patch
  • 1602_4.patch
  • 1602_5.patch
  • 1602_6.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/zooba' closed_at = created_at = labels = ['type-bug', 'expert-unicode', 'OS-windows'] title = "windows console doesn't print or input Unicode" updated_at = user = 'https://github.com/mark-summerfield' ``` bugs.python.org fields: ```python activity = actor = 'THRlWiTi' assignee = 'steve.dower' closed = True closed_date = closer = 'steve.dower' components = ['Unicode', 'Windows'] creation = creator = 'mark' dependencies = [] files = ['19493', '20320', '20363', '23461', '23470', '23471', '36120', '40990', '44094', '44290', '44379', '44409', '44449', '44452'] hgrepos = [] issue_num = 1602 keywords = ['patch'] message_count = 148.0 messages = ['58487', '58621', '58651', '87086', '88059', '88077', '92854', '94445', '94480', '94483', '94496', '108173', '108228', '116801', '120414', '120415', '120416', '120700', '125823', '125824', '125826', '125833', '125852', '125877', '125889', '125890', '125898', '125899', '125938', '125942', '125947', '125956', '126286', '126288', '126303', '126304', '126308', '126319', '127782', '131657', '131854', '132060', '132061', '132062', '132064', '132065', '132067', '132184', '132191', '132208', '132266', '132268', '145898', '145899', '145963', '145964', '146471', '148990', '157569', '160812', '160813', '160897', '161151', '161153', '161308', '161651', '164572', '164578', '164580', '164618', '164619', '170899', '170915', '170999', '185135', '197700', '197751', '197752', '197773', '221175', '221178', '223403', '223404', '223507', '223509', '223945', '223946', '223947', '223948', '223949', '223951', '223952', '224019', '224086', '224095', '224596', '224605', '224690', '227329', '227330', '227332', '227333', '227337', '227338', '227347', '227354', '227373', '227374', '227441', '227450', '228191', '228208', '228210', '233347', '233350', '233916', '233937', '234019', '234020', '234096', '234371', '242884', '254405', '254407', '272596', '272605', '272645', '272662', '272675', '272716', '272718', '272720', '273999', '274449', '274673', '274884', '274906', '274912', '274939', '275003', '275004', '275005', '275157', '275362', '275510', '277047', '277048', '277050'] nosy_count = 38.0 nosy_names = ['lemburg', 'mhammond', 'terry.reedy', 'paul.moore', 'tzot', 'amaury.forgeotdarc', 'ncoghlan', 'pitrou', 'giampaolo.rodola', 'tim.golden', 'mark', 'ned.deily', 'christoph', 'ezio.melotti', 'v+python', 'hippietrail', 'flox', 'THRlWiTi', 'davidsarah', 'santoso.wijaya', 'akira', 'David.Sankel', 'python-dev', 'smerlin', 'lilydjwg', 'berker.peksag', 'martin.panter', 'piotr.dobrogost', 'eryksun', 'Drekin', 'steve.dower', 'wiz21', 'stijn', 'Jonitis', 'gurnec', 'escapewindow', 'dead1ne', 'davispuh'] pr_nums = [] priority = 'high' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = '28217' type = 'behavior' url = 'https://bugs.python.org/issue1602' versions = ['Python 3.6'] ```

    ncoghlan commented 10 years ago

    Drekin, it would be good to be able to incorporate some of your improvements for Python 3.5. Before we could do that, we'd need to review and agree to the PSF Contributor Agreement at https://www.python.org/psf/contrib/contrib-form/

    The underlying licensing situation for CPython is a little messy (albeit in a way that doesn't impact users or redistributors), so we use the contributor agreement to ensure we continue to have the right to distribute Python under its current license without making the history any messier, and to preserve the option of switching to a simpler standard license at some point in the future (if it ever becomes feasible to do so).

    90ff3b65-5c08-4f89-bb59-3f03c3cccbb5 commented 10 years ago

    Stefan Champailler:

    The crash you see is maybe not a crash at all. First it has nothing to do with printing, the problem is reading of your input line. That explains why Python exited even before printing the traceback of the SyntaxError. If you try to read input using sys.stdin.buffer.raw.read(100) and type Unicode characters, it returns just empty bytes b''. So maybe Python REPL then thinks the input just ended and so standardly exits the interpreter.

    Why are you using chcp 65001? As far as I know, it doesn't give you the ability to use Unicode in the console. It somehow helps with printing, but there are some issues. print("\N{euro sign}") prints the right character, but it prints additional blank line. sys.stdout.write("\N{euro sign}") and sys.stdout.buffer.write("\N{euro sign}".encode("cp65001")) does the same, but sys.stdout.buffer.raw.write("\N{euro sign}".encode("cp65001")) works as expected.

    If you want to enter and display Unicode in Python on Windows console, try my package win_unicode_console, which tries to solve the issues. See https://pypi.python.org/pypi/win_unicode_console.

    90ff3b65-5c08-4f89-bb59-3f03c3cccbb5 commented 10 years ago

    Nick Coghlan: Ok, done.

    ncoghlan commented 10 years ago

    Drekin: thanks! That should get processed by the PSF Secretary before too long, and the "*" to indicate you have signed it will appear by your name.

    9b76dcf1-c6e5-4017-9e13-b31dcebd7cc6 commented 10 years ago

    Dear Drekin,

    The crash you see is maybe not a crash at all. First it has nothing to do with printing, the problem is reading of your input line.

    I guessed that, but thanks for pointing out.

    So maybe Python REPL then thinks the input just ended and so standardly exits the interpreter.

    Yes. I have showed that because the line of code seemed perfectly valid and innocuous (I moved to Python3 because I *need* good unicode/encodings support). The answer from the REPL is, to me, very suprising. I would have expected a badly displayed character at least and a syntax error at worst. I consider myself quite aware of unicode issues but without any output from the repl, I'd have very hard times figuring out what went wrong, hence my bug report.

    So even though this might not qualify as the worse bug in Python, I'd say it is actually quite misleading. But see no complaint here, I'm very happy with Python in general. It's just that I thought I had to tell it to the dev team.

    Why are you using chcp 65001?

    I thought it'd help me with printing unicode (I tried CP437 but problem is the EURO sign is not there, and I *do* need eurosign :-)). But I'll readily admit I didn't read all the stuff about encoing issues on Windows console before trying.

    try my package win_unicode_console, which tries to solve the issues.

    I'll certainly do that.

    Thank you for your answer

    Stefan

    mhammond commented 10 years ago

    The crash you see is maybe not a crash at all.

    I'd call it a "crash" - the repl shouldn't exit. But it's not necessarily part of *this* bug.

    terryjreedy commented 10 years ago

    Stefan, the Idle Shell handles the BMP subset of Unicode quite well.

    >>> print('€')
    €
    >>>

    It is superior to the Windows console in other ways too. For instance, cut and paste work normally as for other Windows windows.

    (cp65001 is know to be buggy and essentially useless. Check the results in any search engine.)

    90ff3b65-5c08-4f89-bb59-3f03c3cccbb5 commented 10 years ago

    Idle shell handles Unicode characters well, but one cannot enter them using deadkey combinations. See http://bugs.python.org/issue22408.

    9b76dcf1-c6e5-4017-9e13-b31dcebd7cc6 commented 10 years ago

    Thank you all for your quick and good answers. This level of responsiveness is truly amazing.

    I've played a bit with IPython and it works just fine. I can type the eurosign drectly with "Alt Gr - E" (so I didn't enter a unicode code). So the bug is basically solved for me. But the python-repl behaviour still looks strange to me. So here's a successful IPython session :

    C:\PORT-STCA2\pl-PRIVATE\horse>chcp 65001 Active code page: 65001

    C:\PORT-STCA2\pl-PRIVATE\horse>ipython Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:38:22) [MSC v.1600 32 bit (Intel)] Type "copyright", "credits" or "license" for more information.

    IPython 2.2.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details.

    In [1]: print('€')

    In [2]:

    ncoghlan commented 10 years ago

    Aye, IPython has the advantage of running in a fully initialised browser, with the backend in a fully initialised Python environment.

    CPython's setting up the standard streams for the default REPL at a much lower level, and there are quite a few problems with the way we're currently doing it.

    I think Drekin's pointed the way towards substantially improving the situation for 3.5, though.

    1e31820c-81fc-44df-b6ce-a9e6bfd187c9 commented 10 years ago

    New here, but I think this is the correct issue to get info about this unicode problem. On the windows console:

    chcp Active code page: 437

    type utf.txt Привет

    chcp 65001 Active code page: 65001

    type utf.txt Привет

    python --version Python 3.5.0a0

    cat utf.py f = open('utf.txt') l = f.readline() print(l) print(len(l))

    python utf.py Привет �²ÐµÑ‚ �‚

    13

    cat utf_explicit.py import codecs f = codecs.open('utf.txt', encoding='utf-8', mode='r') l = f.readline() print(l) print(len(l))

    python utf_explicit.py Привет ет

    7

    I partly read through the page but these things are a bit above my head. Could anyone explain

    type utf2.txt aαbβcγdδ

    cat utf2.py import streams import codecs streams.enable() f = codecs.open('utf2.txt', encoding='utf-8', mode='r') print(f.read(1)) print(f.read(1)) print(f.read(2)) print(f.read(4))

    python utf2.py a α bβc γdδ

    90ff3b65-5c08-4f89-bb59-3f03c3cccbb5 commented 10 years ago

    stijn: You are mixing two issues here. One is reading text from a file. There is no problem with it. You just call open(path, encoding=the_encoding_of_the_file). Since the encoding of the file depends on the file, you should provide the information about it.

    Another issue is interactively entering and displaying Unicode characters in Python REPL in Windows console. That's what is this issue about. The streams code you use is outdated, for recent version see https://pypi.python.org/pypi/win_unicode_console and https://github.com/Drekin/win-unicode-console. It's an installable package which tries to solve the issue. The readme also contains a summary of the issue. Try the package and let me know if there is any problem.

    1e31820c-81fc-44df-b6ce-a9e6bfd187c9 commented 10 years ago

    Drekin: you're right for both input and output. Using encoding with plain open() works just fine and using the latest win-unicode-console does give correct output for the second example as well. Thanks!

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 9 years ago

    Just to note that another side effect of this bug is that stepping through code where the source contains non-ASCII characters results in pdb producing an error when trying to print the source lines. This makes stepping through such source code impossible.

    I mention it, because it hasn't been mentioned before, and debuggers are mysterious and low-level enough, that solutions that might work for normal code, may not solve working with the debugger...

    90ff3b65-5c08-4f89-bb59-3f03c3cccbb5 commented 9 years ago

    I tried the following code:

    import pdb
    pdb.set_trace()
    print(1 + 2)
    print("αβγ∫")

    When run in vanilla Python it indeed ends with UnicodeEncodeError as soon as it hits the line with non-ASCII characters. However, the solution via win_unicode_console package seems to work correctly. There is just an issue when you keep calling 'next' even after the main program ended. It ends with a RuntimeError after a few iterations. I didn't know that pdb can continue debugging after the main program has ended.

    5816ef42-463e-4fc5-927b-6bbe33b746a8 commented 9 years ago

    Drekins module at https://github.com/Drekin/win-unicode-console is great, but there is small issue with it when running within debugger in Visual Studio (Python Tools for Visual Studio 2.1 installed). Debugger already wraps stdout and stderr inside the visualstudio_py_debugger._DebuggerOutput wrapper and it does not have the fileno() method which win-unicode-console stream.py check_stream() expects. I've created potential fix for it at https://github.com/Drekin/win-unicode-console/pull/4/commits that checks whether object has old_out and uses it to get to fileno. There might be much more robust ways to check for wrappers. I just wanted to make you aware, if this code will be used as basis for Python 3.5.

    zooba commented 9 years ago

    It sounds like the script should handle the case where someone has already changed stdout better. We wrap the streams in PTVS so we can forward the output into the IDE where Unicode will display properly anyway.

    Our wrapper missing fileno is a bug in our side, but finding the original one will break output forwarding.

    90ff3b65-5c08-4f89-bb59-3f03c3cccbb5 commented 9 years ago

    Note that win-unicode-console replaces the stdio streams rather than wraps them. So the desired state would be Unicode stream objects wrapped by PTVS. There would be no problem if win-unicode-console stream replacement occured before PTVS wraps them, which should be the case when Unicode streams for Windows are hadled by Python 3.5 itself. Is there any way to run custom Python code (like sitecustomize) before PTVS wraps the stdio streams?

    5816ef42-463e-4fc5-927b-6bbe33b746a8 commented 9 years ago

    Presumably Unicode streams would also fix file redirects. Currently, if you want to redirect stdout output to file it throws. For example PowerShell: C:\Python34\python.exe .\test.py | out-file -Encoding utf8 -FilePath 'test.txt'

    90ff3b65-5c08-4f89-bb59-3f03c3cccbb5 commented 9 years ago

    File redirection has nothing to do with win-unicode-console and this issue. When stdout is redirected, it is not a tty so win-unicode-console doesn't replace the stream object, which is the right thing to do. You got UnicodeEncodeError because Python creates sys.stdout with encoding based on your locale. In my case it is cp1250 which cannot encode whole Unicode. You can control the encoding used by setting PYTHONIOENCODING environment variable. For example, if you have a script producer.py, which prints some Unicode characters, and consumer.py, which just does print(input()), then "py producer.py | py consumer.py" shows that redirection works (when PYTHONIOENCODING is set e.g. to utf-8).

    mhammond commented 9 years ago

    File redirection has nothing to do with win-unicode-console

    Thank you, that comment is spot on - there are multiple issues being conflated here. This bug is purely about the tty/console behaviour.

    ncoghlan commented 9 years ago

    It sounds like fixing this properly requires fixing bpo-17620 first (so the interactive interpreter actually uses sys.stdin), so I've flagged that as a dependency.

    61b53079-cd21-406a-b489-d4746f6d2ec7 commented 9 years ago

    I've tried addressing the output problem by subclassing TextIOWrapper to use the windows functions GetConsoleOutputCP and WideCharToMultiByte.

    I've tested this as well as I can without figuring out how to install a better font for the windows console. It appears to work on both python 3.4 and 2.7 although there may be an issue with 2.7 and CJK Extension B and higher codepoints.

    Hopefully this is useful in finally resolving the issue. Also I think some maintenance patch for 2.7 is in order as currently it fails utterly if you set the console to 65001 since it doesn't recognize it. Had to wrap all print statements in try/except so it wouldn't fail before testing the wrapper.

    90ff3b65-5c08-4f89-bb59-3f03c3cccbb5 commented 9 years ago

    dead1ne: Hello, I'm maintaining a package that tries to solve this issue: https://github.com/Drekin/win-unicode-console . There are actually many related problems.

    zooba commented 8 years ago

    I'm now actively working on this for 3.6.

    I've attached my first pass at implementing an alternative raw IO stream that uses the *ConsoleW APIs instead of the CRT. It works fine for basic print() and input() (including handling redirection "properly", which is a separate issue to change the default encoding there, and not bpo-17620 yet).

    I expect there to be many *many* compatibility issues with this change, so we really need everyone interested to try it out and see what doesn't work. So far I haven't even tried looking at readline hooks or similar (though maybe all those issues fall under bpo-17620?).

    Any *specific, technical* information about compatibility issues would be appreciated (i.e. enough that I can fix the issue without having to completely reproduce your setup - I'll be working on doing those myself anyway, so simply saying "X is broken" isn't helpful yet).

    It doesn't look like this will be available in 3.6.0a4, but I think I should be able to land it by the first beta.

    90ff3b65-5c08-4f89-bb59-3f03c3cccbb5 commented 8 years ago

    Hello Steve, that's great you are working on this!

    I've ran through your patch and I have the following remarks:

    • Since wide chars have two bytes, there may be problem when someone wants to read or write odd number of bytes. If the number is > 1, it's ok since the code may read or write less bytes, but when the number is exactly 1, the code should maybe raise some exception.

    • WriteConsoleW always fails with ERROR_NOT_ENOUGH_MEMORY (8) if we try to write more than a certain number of bytes. For me, the number is something like 41000. Unfortunately, it depends on actual heap usage of the console process. I do len = min(len, 32767) in write. The the value chosen comes from bpo-11395 .

    • If someone types something like ^Zfoo, the standard sys.stdin returns '' -- it ignores everything after EOF if it is the first byte read. I reproduce the bahaviour in win_unicode_console to be compatible.

    • There may be some issue when someone hits Ctrl-C on input. It seems that in that case, ReadConsoleW fails with ERROR_OPERATION_ABORTED (995) and some signal is asynchronously fired. It may happen that the corresponding KeyboardInterrupt exception occurs later that it should. In my Python/ctypes situation I do an ugly hack – I detect ERROR_OPERATION_ABORTED and in that case I sleep for 0.1 seconds to wait for the exception. I understand that the situation may me different in C.

    vadmium commented 8 years ago

    For compatibility, I think it may be good to add custom implementations of the buffer attribute and detach() method to stdin/out. They should be able to at least read and write ASCII bytes. It might be easiest to keep them as the current BufferedReader/Writer objects. Probably also make stdin/out.fileno() defer to the buffer attribute.

    With the current patch that only allows reading and writing in UTF-16 pairs, I forsee a few problems:

    90ff3b65-5c08-4f89-bb59-3f03c3cccbb5 commented 8 years ago

    There is also the following consequence of (not) having the standard filenos: input() either considers the streams interactive or not. To consider them interactive, standard filenos and isatty are needed on sys.stdin and sys.stdout.

    If the streams are considered interactive, input() goes via readlinehook machinery, otherwise it just writes and reads an ordinary file.

    The latter means we don't have to touch readline machinery now, the downside is that custom rlcompleters like pyreadline won't work on input().

    zooba commented 8 years ago

    The current patch actually only affects the raw IO, so the concern would be one of the wrappers trying to work in bytes when it should be dealing in characters. This should be no different from reading a UTF16 file, so either both work or both are broken.

    The readline API is most annoying because it assumes strlen is valid for any encoded text (and at so many places it's near unfixable), but there's another issue for this part.

    Also, I don't have answers for most of the questions in the review on the patch because I copied all of those bits from fileio.c. Can certainly clean parts of them up for the console API, but I count compatibility with the FileIO class a useful goal where possible.

    zooba commented 8 years ago

    I'm fairly happy with where my current patch is at (not posted right now - too many different machines involved) and only one test is failing - test_cgi.

    The problem seems to be that urllib.parse.unquote() takes an encoding parameter to decode utf-8 encoded bytes with, and cgi infers this parameter from sys.stdin. I don't have the slightest idea why unquote/unquote_to_bytes unconditionally encodes with utf-8 and then allows decoding with an arbitrary encoding, but I guess it works okay for ASCII-compatible encodings?

    Unfortunately, utf-16-le is not ASCII compatible, and so this doesn't work. I'm not familiar enough with cgi or urllib.parse to know what to fix - any suggestions?

    zooba commented 8 years ago

    For more info here, cgi.parse has code like this:

    def parse(fp, ...):
        if fp is None:
            fp = sys.stdin
    
        encoding = getattr(fp, 'encoding', 'latin-1')
    # later on...
    
    return urllib.parse.parse_qs(a_str, encoding=encoding, ...)

    As an easy hack, I added this after assigning encoding:

        if len(' '.encode(encoding, errors='replace')) > 1:
            encoding = 'latin-1'

    I have no idea if this is a good idea or not. The current behaviour of mojibake in the parsed result is certainly worse, since the choice of utf-16-le is entirely contained within the parse() function.

    vadmium commented 8 years ago

    I think this CGI thing is a separate bug, just exacerbated by the stdin.encoding problem. :) The urllib.parse.parse_qs() function takes an encoding parameter to figure out what to do with percent-encoded values: "%A9" → b"\xA9".decode(...). This is different lower-level encoding: b"%A9".decode("ascii").

    Maybe the best solution is just to remove the encoding argument, and let it revert to UTF-8, as it did before r87998. Or maybe it really should use the locale encoding. (Is that ASCII-compatible on Windows?) It really depends on where the query string was generated (in a browser, pre-computed URL, etc).

    zooba commented 8 years ago

    New patch attached (1602_2.patch - hopefully the review will work this time too).

    I discovered while researching for the PEP that a decent amount of code expects to be able to write ASCII to sys.stdout.buffer (or sys.stdout.buffer.raw). As my first patch required utf-16-le at this point, it was going to cause havoc.

    Rather than break that compatibility, I decided that exposing utf-8 and doing the reencoding at the latest possible stage was better. This is also more consistent with how other encoding issues are likely to be resolved, and shouldn't be any less performant, given that previously we were decoding to utf-16 anyway.

    The downsides of this is that read(n) now can only read up to n/4 characters, and write(n) has a much more complicated time dealing with large buffers (as we need to cap the number of utf-16-le bytes but return the number of utf-8 bytes - it's not a direct relationship, so there's more work and a little bit of guessing in some cases).

    On the upside, the readline handling is simpler as utf-8 is compatible with the existing interface and now sys.stdin.encoding is accurate. I've rolled that fix into this patch (just the myreadline.c change) as they really ought to go in together.

    zooba commented 8 years ago

    Updated patch. This implements everything we've been discussing on python-dev

    zooba commented 8 years ago

    Latest patch is attached.

    PEP acceptance is sounding likely, so feel free to critically review.

    zooba commented 8 years ago

    Updated patch based on some suggestions from Eryk.

    The PEP has been accepted, so now I just need to land it in the next two days.

    Currently "normal" usage here is fine, and some edge cases match the Python 3.5 behaviour. I'm going to go through now and bulk out the tests to try and catch more problems, but modulo that I hope the implementation is nearly ready.

    zooba commented 8 years ago

    I can't actually come up with many useful tests for this... so far I can validate that we can open the console IO object from 0, 1, 2, "CON", "CONIN$" and "CONOUT$", get fileno(), check readable()/writable() and close (multiple times without crashing).

    Anything else requires a real console with a real person with a real keyboard.

    But I fixed a couple of issues in fd handling as a result of the tests, so it's not a complete waste.

    berkerpeksag commented 8 years ago

    I left some minor comments for Doc/whatsnew/3.6.rst on Rietveld.

    In Lib/test/test_winconsoleio.py:

      if __name__ == '__main__':
          unittest.main()
    zooba commented 8 years ago

    Thanks! I've made the changes you suggested.

    vadmium commented 8 years ago

    +++ b/Lib/test/testwinconsoleio.py +to real people with real keyborads. Should be keyboards There are still assert() calls in this file (1602_6.patch). Did you miss them?

    +++ b/Lib/io.py +from _io import WindowsConsoleIO +all.append('WindowsConsoleIO') I think you should either document this class, or remove it from __all__ to clarify it is just an implementation detail.

    +++ b/Modules/_io/winconsoleio.c +_io_WindowsConsoleIOinitimpl + PyObject *decodedname = Py_None; + Py_INCREF(decodedname); + int d = PyUnicode_FSDecoder(nameobj, (void*)&decodedname); Won’t this leak a reference to Py_None? (Also, I think needless casting like in the last line can mask mistakes that the compiler would otherwise pick up. Imagine if you got the parameters around the wrong way.)

    +read_console_w(HANDLE handle, DWORD maxlen, DWORD *readlen) { + / If we didn't read a full buffer that time, don't try + again or we will block a second time. \/ I’m not familiar with the Windows APIs involved, but this doesn’t seem robust. What if there were exactly one full buffer waiting, would the next call block without returning anything?

    vadmium commented 8 years ago

    Ah sorry I see Berker’s assert() comment was _after you posted 1602_6.patch, so ignore that bit :)

    vadmium commented 8 years ago

    Also as I understand it, the open() function can return this new class, so the documentation at \https://docs.python.org/3.6/library/functions.html#open\ needs updating.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

    New changeset 6142d2d3c471 by Steve Dower in branch 'default': Issue bpo-1602: Windows console doesn't input or print Unicode (PEP-528) https://hg.python.org/cpython/rev/6142d2d3c471

    eryksun commented 8 years ago

    Martin, the console should be in line-input mode, in which case ReadConsole will block if there isn't at least one line in the input buffer. It reads up to the lesser of a complete line or the number of UTF-16 codes requested. If the previous call read the entire request size but didn't stop on '\n', then we know the next call shouldn't block because the input buffer has at least one '\n' in it.

    I can validate that we can open the console IO object from 0, 1, 2, "CON", "CONIN$" and "CONOUT$", get fileno(), check readable()/writable() and close (multiple times without crashing).

    I like the idea to have fileno() lazily get a file descriptor on demand, but _open_osfhandle is a low I/O function that uses _open flags -- not 'rb' (int 0x7262) or 'wb' (int 0x7762). ;-)

    You can use _O_RDONLY | _O_BINARY or _O_WRONLY | _O_BINARY. But really those values would be ignored anyway. It's not actually opening the file, so it only cares about a few flags. Specifically, in lowio\osfinfo.cpp I see that it looks for _O_APPEND, _O_TEXT, and _O_NOINHERIT.

    On line 329, the following assignment

        if (self->writable)
            access |= GENERIC_WRITE;

    should be access = GENERIC_WRITE. Requesting both read and write access is an invalid parameter when opening "CON", as can be seen here:

        >>> f = open('CON', 'wb', buffering=0)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        OSError: [WinError 87] The parameter is incorrect: 'CON'

    CONOUT$ works, of course:

        >>> f = open('CONOUT$', 'wb', buffering=0)
        >>> f
        <_io._WindowsConsoleIO mode='wb' closefd=True>

    Lastly, for a readall that starts with ^Z, you're still breaking out of the loop before incrementing len, which is thus 0 when subsequently checked. It ends up calling WideCharToMultiByte with len == 0, which fails.

        >>> sys.stdin.buffer.raw.read()
        ^Z
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        OSError: [WinError 87] The parameter is incorrect

    I can't actually come up with many useful tests for this...

    ctypes can be used to write to the input buffer and read from a screen buffer. For the latter it helps to first create and activate a scratch screen buffer, initialized to NULs to make it easy to read back everything that was written up to the current cursor position. I have existing ctypes code for this, written to solve the problem of a subprocess that stubbornly writes directly to the console instead of writing to stdout/stderr pipes.

    vadmium commented 8 years ago

    Okay so regarding blocking reads with a full buffer, what you are saying is the second check to break the read loop should be sufficient:

    +/ If the buffer ended with a newline, break out */ +if (buf[readlen - 1] == '\n') + break;

    19c24d5a-3f3d-4fd4-9bae-14ed5acc00ce commented 8 years ago

    Steve Dower (steve.dower)

    [...] Anything else requires a real console with a real person with a real keyboard.

    FYI, not really, it is possible to fully automatically test console's output/input using WinAPI functions like WriteConsoleInput, GetConsoleScreenBufferInfo, ReadConsoleOutputCharacter

    very recently I wrote such test, you can look at it as example http://review.source.kitware.com/gitweb?p=KWSys.git;a=blob;f=testConsoleBuf.cxx;hb=HEAD

    it tests all 3 cases when output is actual console, redirected pipe and file.

    zooba commented 8 years ago

    Oh nice, I like that. We should definitely add some tests using that (though it seems like quite a big task... maybe I'll open a new issue for it).

    zooba commented 8 years ago

    Created bpo-28217 for adding these tests.