Invalid no-ASCII unicode string handling

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. import and initalize colorama with deafults
2. print u"Some non-ASCII text ТЕСТ Русского"

What is the expected output? What do you see instead?
===
Some non-ASCII text ТЕСТ Русского
===
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: 
ordinal not in range(128)
===

What version of the product are you using? On what operating system?
Python 2.7 (x32)
Windows 7 x64 Untimate (with Eng/Rus locales)

Please provide any additional information below.
Looks like wrapped write method hoes not inherit/use original stdout encoding. 
Possible fixes are (two ways):

A. use sys.setdefaultencoding(Your-Console-OEM-Encoding) # Wrong way IMHO. I 
don't know the simple method to determine right console mode (ANSI/OEM) and OEM 
encoding except reading 'stdout.encoding' property

B. Patch ansitowin32 to force-encode unicode output before .write:

--- D:\lg\py\colorama-0.1.18\colorama\ansitowin32.py    Tue May 18 14:43:54 2010
+++ ansitowin32.py  Wed Feb 23 19:10:40 2011
@@ -144,7 +144,10 @@

     def write_plain_text(self, text, start, end):
         if start < end:
-            self.wrapped.write(text[start:end])
+            if isinstance(text, unicode):
+                
self.wrapped.write(text[start:end].encode(self.wrapped.encoding))
+            else:
+                self.wrapped.write(text[start:end])
             self.wrapped.flush()

Original issue reported on code.google.com by av1024@gmail.com on 23 Feb 2011 at 4:30

GoogleCodeExporter commented 9 years ago

Thanks for the bug report.

Original comment by tart...@gmail.com on 23 Feb 2011 at 5:57

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

There are vague reports that fixing the colorama stdout wrapping to happen 
during init, rather than at import time, may have implications for unicode 
handling.

Original comment by tart...@gmail.com on 14 Oct 2013 at 7:54

GoogleCodeExporter commented 9 years ago

Hi. I'm finally looking at this, but my unicode knowledge is fairly hazy.

I'm trying to reproduce at the interactive interpreter on Win7 in a cmd.exe 
running Python2.7, but not having much luck thus far.

If I try your example above:

>>> import colorama
>>> colorama.init()
>>> print 

and then right after the print (before a newline) I paste your example unicode 
string above, u"Some non-ASCII text ТЕСТ Русского", then my console 
doesn't seem to recognise the non-ascii characters in the last two words - they 
appear as question marks. Then when I press enter, the print statement executes 
fine, without any exception, but the output also just ends in question marks.

If I try some other non-ASCII characters, then this works fine: e.g.

>>> >>> print u"Some non-ASCII \u00e9coöperate"
Some non-ASCII écoöperate

Can anyone help me put together a test case to reproduce the problem? Maybe 
using u'\uXXXX' characters?

Original comment by tart...@gmail.com on 20 Apr 2014 at 9:17

GoogleCodeExporter commented 9 years ago

Hmm... Since migrating to linux I have no ready-to-example code :(

1. You should use u'\uXXXX' in interactive console or write simple .py file in 
utf-8 encoding.
2. The test string should contain unicode symbols with code > 255. For exmaple 
(same word "Test" in English and in Russian) print 
u'Test:\u0422\u0435\u0441\u0442'

Original comment by av1024@gmail.com on 21 Apr 2014 at 5:56

GoogleCodeExporter commented 9 years ago

For now I have code above in my colored logger init module. But it is not 
tested more than 2 years and I don't remember what issues was here.

def setup_console(sys_enc='utf-8', use_colorama=True):
    """
    Set sys.defaultencoding to `sys_enc` and update stdout/stderr writers to corresponding encoding

    .. note:: For Win32 the OEM console encoding will be used istead of `sys_enc`
    """
    global ansi
    reload(sys)
    try:
        if sys.platform.startswith("win"):
            import ctypes
            enc = "cp%d" % ctypes.windll.kernel32.GetOEMCP()
        else:
            enc = (sys.stdout.encoding if sys.stdout.isatty() else
                        sys.stderr.encoding if sys.stderr.isatty() else
                            sys.getfilesystemencoding() or sys_enc)

        if sys.getdefaultencoding().lower() != sys_enc.lower():
            sys.setdefaultencoding(sys_enc)

        if sys.stdout.isatty() and sys.stdout.encoding != enc:
            sys.stdout = codecs.getwriter(enc)(sys.stdout, 'replace')

        if sys.stderr.isatty() and sys.stderr.encoding != enc:
            sys.stderr = codecs.getwriter(enc)(sys.stderr, 'replace')

        if use_colorama and sys.platform.startswith("win"):
            try:
                from colorama import init
                init()
                ansi = True
            except:
                pass

    except:
        pass

Original comment by av1024@gmail.com on 21 Apr 2014 at 6:01

GoogleCodeExporter commented 9 years ago

@tartley, regarding comment 3

tartley> and then right after the print (before a newline) I paste your example 
unicode string above, u"Some non-ASCII text ТЕСТ Русского", then my 
console doesn't seem to recognise the non-ascii characters in the last two 
words - they appear as question marks.

This is a fail in the python console paste, as demonstrated by
  - copy to clipboard the string "Some non-ASCII text ТЕСТ Русского" including quotes
  - in the python console type s = u
  - paste after the u the clipboard content
  - enter a return to go next line
  - print repr(s)

I got

Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u"Some non-ASCII text ???? ????????"
>>> print repr(s)
u'Some non-ASCII text ???? ????????'
>>>

Clearly the paste was unsuccessful, and python sort of sanitized the paste.
(by the way, if you paste in a python script and run in a windows cmd console 
the OP case is reproducible, see attached test_decode.py)

A better string is a single i with acute accent like 'í'

>>> import colorama
>>> colorama.init()
>>> s = u'í'
>>> print repr(s)
u'\xed'
>>> print s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python26\lib\site-packages\colorama-0.3.1-py2.6.egg\colorama\ansitowi
n32.py", line 35, in write
    self.__convertor.write(text)
  File "C:\Python26\lib\site-packages\colorama-0.3.1-py2.6.egg\colorama\ansitowi
n32.py", line 116, in write
    self.write_and_convert(text)
  File "C:\Python26\lib\site-packages\colorama-0.3.1-py2.6.egg\colorama\ansitowi
n32.py", line 143, in write_and_convert
    self.write_plain_text(text, cursor, len(text))
  File "C:\Python26\lib\site-packages\colorama-0.3.1-py2.6.egg\colorama\ansitowi
n32.py", line 148, in write_plain_text
    self.wrapped.write(text[start:end])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 0:
ordinal not in range(128)

We found a similar colorama traceback in Nikola, the static blog generator ( 
https://github.com/getnikola/nikola/issues/1288 ) 

May I ask what text type expects colorama ? unicode, bytes or should both be 
acceptable ? If bytes, it assumes some specific encoding ?

Original comment by ccanepacc@gmail.com on 18 May 2014 at 1:45

Attachments:

test_decode.py

GoogleCodeExporter commented 9 years ago

Migrated to https://github.com/tartley/colorama/issues/36
closing as duplicate.

Original comment by tart...@gmail.com on 18 Feb 2015 at 1:51

Changed state: Duplicate

nghung270192 / colorama

Invalid no-ASCII unicode string handling #21