python / cpython

The Python programming language
https://www.python.org
Other
63.37k stars 30.33k forks source link

[doc] email.header.Header.__unicode__ does not decode header #61707

Open 61a746e0-56f6-4f74-bbde-a98c5612db23 opened 11 years ago

61a746e0-56f6-4f74-bbde-a98c5612db23 commented 11 years ago
BPO 17505
Nosy @warsaw, @hniksic, @bitdancer, @JelleZijlstra, @Vidhyavinu
PRs
  • python/cpython#31765
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['easy', 'type-bug', 'expert-email', '3.10', '3.11', '3.9', 'docs'] title = '[doc] email.header.Header.__unicode__ does not decode header' updated_at = user = 'https://github.com/hniksic' ``` bugs.python.org fields: ```python activity = actor = 'vidhya' assignee = 'docs@python' closed = False closed_date = None closer = None components = ['Documentation', 'email'] creation = creator = 'hniksic' dependencies = [] files = [] hgrepos = [] issue_num = 17505 keywords = ['patch', 'easy'] message_count = 12.0 messages = ['184856', '184894', '184896', '184897', '185028', '414228', '414234', '414273', '414508', '414530', '414542', '414758'] nosy_count = 6.0 nosy_names = ['barry', 'hniksic', 'r.david.murray', 'docs@python', 'JelleZijlstra', 'vidhya'] pr_nums = ['31765'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue17505' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11'] ```

    61a746e0-56f6-4f74-bbde-a98c5612db23 commented 11 years ago

    The __unicode__ method is documented to "return the header as a Unicode string". For this to be useful, I would expect it to decode a string such as "=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=" into a Unicode string that can be displayed to the user, in this case u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'.

    However, unicode(header) returns the not so useful u"=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=". Looking at the code of __unicode, it appears that the code does attempt to decode the header into Unicode, but this fails for Headers initialized from a single MIME-quoted string, as is done by the message parser. In other words, __unicode is failing to call decode_header.

    Here is a minimal example demonstrating the problem:

    >>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n')
    >>> unicode(msg['subject'])
    u'=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='

    Expected output of the last line: u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'

    To get the fully decoded Unicode string, one must use something like:
    >>> u''.join(unicode(s, c) for s, c in email.header.decode_header(msg['subject']))

    which is unintuitive and hard to teach to new users of the email package. (And looking at the source of __unicode__ it's not even obvious that it's correct — it appears that a space must be added before us-ascii-coded chunks. The joining is non-trivial.)

    The same problem occurs in Python 3.3 with str(msg['subject']).

    61a746e0-56f6-4f74-bbde-a98c5612db23 commented 11 years ago

    An example of the confusion that lack of a clear "convert to unicode" method creates is illustrated by this StackOverflow question: http://stackoverflow.com/q/15516958/1600898

    bitdancer commented 11 years ago

    I agree that this is not the worlds best API. However, it is the API that we have in 2.7/3.2, and we can't change how Header.__unicode__ behaves without breaking backward compatibility.

    What we could do is add an example of how to use this API to get unicode strings to the top of the docs:

       >>>  unicode(make_header(decode_header('=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=')))
       u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'

    But you already know about that.

    In Python 3.3 you get this:

       >>> msg = message_from_string("subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\n", policy=default)
       >>> msg['subject']
       '这是中文测试!'

    So, I'll make this a doc bug.

    bitdancer commented 11 years ago

    Erg, somehow I failed to read the second half of your message before writing mine...clearly you *didn't* know about that idiom, so the doc patch is obviously an important thing to do.

    To clarify about the 3.3 example: the policy=default is key, it tells the email package to use the new (currently provisional) policy code to provide improved handling of header decoding and encoding.

    61a746e0-56f6-4f74-bbde-a98c5612db23 commented 11 years ago

    Thanks for pointing out the make_header(decode_header(...)) idiom, which I was indeed not aware of. It solves the problem perfectly.

    I agree that it is a doc bug. While make_header is documented on the same place as decode_header and Header itself, it is not explained *why* I should call it if I already have in hand a perfectly valid Header instance. Specifically, it is not at all clear that while unicode(h) and unicode(make_header(decode_header(h)) will return different things -- I would have expected make_header(decode_header(h)) to return an object indistinguishable from h.

    Also, the policy=default parameter in Python 3 sounds great, it's exactly what one would expect.

    9de9bbd3-5ab9-413d-b13c-609591df0d97 commented 2 years ago

    [Entry level contributor seeking guidance] If this is still open, I like to work on this.

    Also, planning to add the following(if no PR yet created) at make_header API at https://docs.python.org/3/library/email.header.html :

    To get unicode strings use the API as shown below: unicode(make_header(decode_header('=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=')))

    If email policy parameter is set as 'policy.default' then the default policy, for that Python version, is used for header encoding and decoding.

    Please correct me if anything wrong.

    JelleZijlstra commented 2 years ago

    The messages above are very old and seem to be discussing Python 2. There is no __unicode__ method any more, for example, though there is a __str__ method which presumably does what __unicode__ used to do. It is documented now at https://docs.python.org/3.10/library/email.header.html#email.header.Header.__str__ . You'll have to do some more digging to figure out whether the OP's complaint still applies.

    9de9bbd3-5ab9-413d-b13c-609591df0d97 commented 2 years ago

    The latest versions 3.9, 3.10 and 3.11 are updated in the issue. So I thought like it still applies.

    @irit: Any suggestions on what needs to be done for current revisions?

    61a746e0-56f6-4f74-bbde-a98c5612db23 commented 2 years ago

    Any suggestions on what needs to be done for current revisions?

    Hi! I'm the person who submitted this issue back in 2013. Let's take a look at how things are in Python 3.10:

    Python 3.10.2 (main, Jan 13 2022, 19:06:22) [GCC 10.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import email
    >>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n')
    >>> msg['Subject']
    '=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='

    So the headers are still not decoded by default. The unicode() invocation in the original description was just an attempt to get a Unicode string out of a byte string (assuming it was correctly decoded from MIME, which it wasn't). Since Python 3 strings are Unicode already, I'd expect to just get the decoded subject - but that still doesn't happen.

    The correct way to make it happen is to specify policy=email.policy.default:

    >>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n', policy=email.policy.default)
    >>> msg['Subject']
    '这是中文测试!'

    The docs should point out that you really want to specify the "default" policy (strangely named, since it's not in fact the default). The current docs only say that message_from_string() is "equivalent to Parser().parsestr(s)." and that policy is interpreted "as with the Parser class constructor". The docs of the Parser constructor don't document policy at all, except for the version when it was added.

    So, if you want to work for this, my suggestion would be to improve the docs in the following ways:

    bitdancer commented 2 years ago

    The policy is named 'default' because it was intended to become the default two feature releases after the new email code became non-provisional (first: deprecate not specifying an explicit policy, next release make default the default policy and make the deprecation only cover compat32). However, for various reasons that switchover did not happen (one big factor being my reduced time spent doing python development). It can happen any time someone steps forward to guide it through the release process.

    9de9bbd3-5ab9-413d-b13c-609591df0d97 commented 2 years ago

    @hniksic: Thanks for your suggestions. I will look into BytesFeedParser documents. @david.murray: I can help you for the switch over to the default in the default policy and update the deprecation as well. It will be good if someone can guide me on this. Being a beginner, I am not sure if we are allowed to change the python code.

    9de9bbd3-5ab9-413d-b13c-609591df0d97 commented 2 years ago

    The PR for the email parser doc update is: https://github.com/python/cpython/pull/31765

    Can someone review it pls.