python / cpython

The Python programming language
https://www.python.org
Other
63.38k stars 30.35k forks source link

unicode format does not really work in Python 2.x #59481

Closed 96e161cb-bd02-43a5-a229-77707c1e46e8 closed 4 years ago

96e161cb-bd02-43a5-a229-77707c1e46e8 commented 12 years ago
BPO 15276
Nosy @loewis, @vstinner, @ericvsmith, @ezio-melotti, @cjerdonek, @serhiy-storchaka
Dependencies
  • bpo-15952: format(value) and value.format() behave differently with unicode format
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', 'type-bug', 'expert-unicode'] title = 'unicode format does not really work in Python 2.x' updated_at = user = 'https://bugs.python.org/ArielBen-Yehuda' ``` bugs.python.org fields: ```python activity = actor = 'serhiy.storchaka' assignee = 'none' closed = True closed_date = closer = 'serhiy.storchaka' components = ['Interpreter Core', 'Unicode'] creation = creator = 'Ariel.Ben-Yehuda' dependencies = ['15952'] files = [] hgrepos = [] issue_num = 15276 keywords = [] message_count = 19.0 messages = ['164844', '164847', '164892', '164902', '164986', '165006', '170570', '170572', '170573', '170581', '170586', '170719', '170778', '170801', '170802', '171011', '174846', '216689', '370432'] nosy_count = 9.0 nosy_names = ['loewis', 'vstinner', 'eric.smith', 'ezio.melotti', 'Arfrever', 'chris.jerdonek', 'serhiy.storchaka', 'Ariel.Ben-Yehuda', 'petr.dlouhy@email.cz'] pr_nums = [] priority = 'normal' resolution = 'out of date' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue15276' versions = ['Python 2.7'] ```

    96e161cb-bd02-43a5-a229-77707c1e46e8 commented 12 years ago

    unicode formats (u'{:n}'.format) in python 2.x assume that the thousands seperator is in ascii, so this fails:

    >>> import locale
    >>> locale.setlocale(locale.LC_NUMERIC, 'fra') # or fr_FR on UNIX
    >>> u'{:n}'.format(10000)
    Traceback (most recent call last):
      File "<pyshell#3>", line 1, in <module>
        u'{:n}'.format(10000)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2: ordinal not in range(128)

    However, it works correctly in python 3, properly returning '10\xA00000' (the \xA0 is a nbsp)

    cjerdonek commented 12 years ago

    Cf. the related bpo-7300: "Unicode arguments in str.format()".

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

    Ariel: would you like to provide a patch?

    96e161cb-bd02-43a5-a229-77707c1e46e8 commented 12 years ago

    I don't work on CPython

    On Sat, Jul 7, 2012 at 6:57 PM, Martin v. Löwis \report@bugs.python.org\wrote:

    Martin v. Löwis \martin@v.loewis.de\ added the comment:

    Ariel: would you like to provide a patch?

    ---------- nosy: +loewis


    Python tracker \report@bugs.python.org\ \http://bugs.python.org/issue15276\


    berkerpeksag commented 12 years ago

    I can't reproduce this with Python 2.7.3.

    berker@wakefield ~[master*]$ python
    Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
    [GCC 4.6.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import locale
    >>> locale.setlocale(locale.LC_NUMERIC, 'fr_FR')
    'fr_FR'
    >>> u'{:n}'.format(10000)
    u'10 000'
    serhiy-storchaka commented 12 years ago

    I confirm the bug on 2.7.

    $ ./python 
    Python 2.7.3+ (2.7:ab9d6c4907e7+, Apr 25 2012, 20:02:36) 
    [GCC 4.4.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import locale
    >>> locale.setlocale(locale.LC_NUMERIC, 'uk_UA.UTF-8')
    'uk_UA.UTF-8'
    >>> u'{:n}'.format(10000)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
    >>> '{:n}'.format(10000)
    '10\xc2\xa0000'
    cjerdonek commented 12 years ago

    I can't yet reproduce on my system, but after looking at the code, I believe the following are closer to the cause:

    >> format(10000, u'n') >> int.__format__(10000, u'n')

    Incidentally, on my system, the following note in the docs is wrong:

    "Note: format(value, formatspec) merely calls value.\_format__(format_spec)."

    (from http://docs.python.org/library/functions.html?#format )

    >>> format(10000, u'n')
    u'10000'
    >>> 10000.__format__(u'n')
      File "<stdin>", line 1
        10000.__format__(u'n')
                       ^
    SyntaxError: invalid syntax
    >>> int.__format__(10000, u'n')
    '10000'

    Observe also that format() and int.__format__() return different types.

    ericvsmith commented 12 years ago

    The case with 10000.__format is confusing the parser. It sees: \<floating point number 10000.> __format which is indeed a syntax error.

    Try:
    >>> 10000 .__format__(u'n')
    '10000'
    
    or:
    >>> (10000).__format__(u'n')
    '10000'
    cjerdonek commented 12 years ago

    The case with 10000.__format__ is confusing the parser.

    Interesting, good catch! That error did seem unusual. The two modified forms do give the same result as int.__format__() (though the type still differs).

    cjerdonek commented 12 years ago

    I did some analysis of this issue.

    For starters, I could not reproduce this on Mac OS X 10.7.4. I iterated through all available locales, and the separator was ASCII in all cases.

    Instead, I was able to fake the issue by changing "," to "\xa0" in the following line--

    http://hg.python.org/cpython/file/820032281f49/Objects/stringlib/formatter.h#l651

    and then reproduce with:

    >>> u'{:,}'.format(10000)
      ..
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2: ordinal not in range(128)
    >>> format(10000, u',')
      ..
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2: ordinal not in range(128)

    However, note this difference (see also bpo-15952)--

    >>> (10000).__format__(u',')
    '10\xa0000'

    The issue seems to be that PyObject_Format() in Objects/abstract.c (which, unlike intformat() in Objects/intobject.c, does respect whether the format string is unicode or not) calls intformat() to get the formatted string as a byte string. It then passes this to PyObject_Unicode() to convert to unicode. This in turn calls PyUnicode_FromEncodedObject() with a NULL encoding, which causes that code to use PyUnicode_GetDefaultEncoding() for the encoding (i.e. sys.getdefaultencoding()).

    The right way to fix this seems to be to make intformat() return unicode as appropriate, which may mean modifying formatter.h's format_int_or_long_internal() to return unicode -- as well as taking into account the locale encoding when accessing the locale's thousands separator.

    cjerdonek commented 12 years ago

    Eric, it looks like you wrote this comment:

    / don't define FORMAT_LONG, FORMAT_FLOAT, and FORMAT_COMPLEX, since we can live with only the string versions of those. The builtin format() will convert them to unicode. \/

    in http://hg.python.org/cpython/file/19601d451d4c/Python/formatter_unicode.c

    It seems like the current issue may be a valid reason for introducing a unicode FORMAT_INT (i.e. not just for type-purity and PEP-3101 compliance, but to avoid an exception). What do you think?

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 12 years ago

    What do you think?

    [Even though I wasn't asked]

    I think we may need to close the issue as "won't fix". Depending on the exact change propsosed, it may be that the return type for existing operations might change, which shouldn't be done in a bug fix release.

    People running into this issue should port to Python 3 (IMO).

    cjerdonek commented 12 years ago

    If we don't fix this (I'm leaning that way myself), I think we should somehow document the limitation. There are ways to acknowledge the limitation without getting into the specifics of this particular issue.

    vstinner commented 12 years ago

    I fixed a similar bug in Python 3.3: issue bpo-13706.

    changeset: 75231:f89e2f4cda88 user: Victor Stinner \victor.stinner@haypocalc.com\ date: Fri Feb 24 00:37:51 2012 +0100 files: Include/unicodeobject.h Lib/test/test_format.py Objects/stringlib/asciilib.h Objects/stringlib/localeutil.h Objects/stringlib/stringdefs.h Objects/stringlib/ucs1lib.h description: Issue bpo-13706: Fix format(int, "n") for locale with non-ASCII thousands separator

    vstinner commented 12 years ago

    I can't reproduce this with Python 2.7.3. >>> locale.setlocale(locale.LC_NUMERIC, 'fr_FR') 'fr_FR' >>> u'{:n}'.format(10000) u'10 000'

    I don't understand why, but the all french locales are the same. Some "french locale" uses the standard ASCII space (U+0020) as thousand seperator, others use the non-breaking space (U+00A0). I suppose that some systems prefer to avoid non-ASCII characters to avoid "Unicode issues".

    On Ubuntu 12.04, locale.localeconv()['thousands_sep'] is chr(32) for the locale fr_FR.utf8.

    You may need to install other locales to test this issue. For example, the ps_AF locale uses U+066b as the decimal point and the thousands separator.

    I chose to not fix the issue in Python 3.2 because it needs to change too much code (and I don't want to introduce a regression and 3.2 code is very different than 3.3). You should upgrade to Python 3.3, or reimplement the Unicode format() function for numbers using locale.localeconv() ('thousands_sep', 'decimal_point' and 'grouping') :-/

    Or find a more motivated developer. Or I can do the job if you pay me ;-)

    (Read also the issue bpo-13706 for more information.)

    cjerdonek commented 12 years ago

    I have a brief documentation patch in mind for this, but it relies on documentation bpo-15952 being addressed first (e.g. to say that format(value) returns Unicode when formatspec is Unicode and that value.\_format__() can return a string of type str). So I'm marking bpo-15952 as a dependency.

    vstinner commented 12 years ago

    "If we don't fix this (I'm leaning that way myself), I think we should somehow document the limitation. There are ways to acknowledge the limitation without getting into the specifics of this particular issue."

    I agree to documentation the limitation and close this issue as "wontfix".

    A workaround is to format as a bytes string, and then decode the result from the locale encoding. It looks like locale.getpreferredencoding(True) should be used, not locale.getpreferredencoding(False).

    29d0b6a5-2876-4ed0-a186-c2c487c65be3 commented 10 years ago

    For anyone stuck on Python 2.x, here is an workaround (maybe it could find it's way to documentation also):

      def fix_grouping(bytestring):
          try:
              return unicode(bytestring)
          except UnicodeDecodeError:
              return bytestring.decode("utf-8")
    serhiy-storchaka commented 4 years ago

    Python 2.7 is no longer supported.