python / cpython

The Python programming language
https://www.python.org
Other
62.88k stars 30.12k forks source link

Encoding issue in the name of the local DST timezone #77440

Open 53cd72ed-7768-4f34-9566-9995ab0e9dae opened 6 years ago

53cd72ed-7768-4f34-9566-9995ab0e9dae commented 6 years ago
BPO 33259
Nosy @pganssle, @maggyero, @tirkarthi

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'library'] title = 'Encoding issue in the name of the local DST timezone' updated_at = user = 'https://github.com/maggyero' ``` bugs.python.org fields: ```python activity = actor = 'p-ganssle' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'maggyero' dependencies = [] files = [] hgrepos = [] issue_num = 33259 keywords = [] message_count = 2.0 messages = ['315181', '325334'] nosy_count = 3.0 nosy_names = ['p-ganssle', 'maggyero', 'xtreak'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue33259' versions = ['Python 3.6'] ```

53cd72ed-7768-4f34-9566-9995ab0e9dae commented 6 years ago

There seems to be an encoding bug in Python 3.6.5 on Windows with the timezone constant time.tzname:

    >>> import time
    >>> time.tzname
    ('Paris, Madrid', 'Paris, Madrid (heure d\x92été)')

In the second string (the name of the local DST timezone), the escape sequence \x92 is (since it is in a character string, not in a byte string) the Unicode code point U+0092 PRIVATE USE 2 (PU2), instead of the Unicode code point U+2019 RIGHT SINGLE QUOTATION MARK as expected, which would have been displayed as or \u2019, so 'Paris, Madrid (heure d’été)'.

This \x92 obviously comes from the 0x92 byte of the CP-1252 encoding for the character, but the byte has been badly handled in time.tzname somehow.

Indeed, quoting the ‘Lexical analysis’ chapter from the *Language Reference*:

In a bytes literal, hexadecimal and octal escapes denote the byte with the given value. In a string literal, these escapes denote a Unicode character with the given value.

tirkarthi commented 6 years ago

Seems like formatting timezone names on Windows has a lot of issues. I don't if it's related to the ones reported before but just like to add reference comment with more issues : https://bugs.python.org/msg302937

Thanks