[doc] email.header.Header.__unicode__ does not decode header

61a746e0-56f6-4f74-bbde-a98c5612db23 commented 11 years ago

BPO	17505
Nosy	@warsaw, @hniksic, @bitdancer, @JelleZijlstra, @Vidhyavinu
PRs	python/cpython#31765

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['easy', 'type-bug', 'expert-email', '3.10', '3.11', '3.9', 'docs'] title = '[doc] email.header.Header.__unicode__ does not decode header' updated_at = user = 'https://github.com/hniksic' ``` bugs.python.org fields: ```python activity = actor = 'vidhya' assignee = 'docs@python' closed = False closed_date = None closer = None components = ['Documentation', 'email'] creation = creator = 'hniksic' dependencies = [] files = [] hgrepos = [] issue_num = 17505 keywords = ['patch', 'easy'] message_count = 12.0 messages = ['184856', '184894', '184896', '184897', '185028', '414228', '414234', '414273', '414508', '414530', '414542', '414758'] nosy_count = 6.0 nosy_names = ['barry', 'hniksic', 'r.david.murray', 'docs@python', 'JelleZijlstra', 'vidhya'] pr_nums = ['31765'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue17505' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11'] ```

61a746e0-56f6-4f74-bbde-a98c5612db23 commented 11 years ago

The __unicode__ method is documented to "return the header as a Unicode string". For this to be useful, I would expect it to decode a string such as "=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=" into a Unicode string that can be displayed to the user, in this case u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'.

However, unicode(header) returns the not so useful u"=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=". Looking at the code of __unicode, it appears that the code does attempt to decode the header into Unicode, but this fails for Headers initialized from a single MIME-quoted string, as is done by the message parser. In other words, __unicode is failing to call decode_header.

Here is a minimal example demonstrating the problem:

>>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n')
>>> unicode(msg['subject'])
u'=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='

Expected output of the last line: u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'

To get the fully decoded Unicode string, one must use something like:
>>> u''.join(unicode(s, c) for s, c in email.header.decode_header(msg['subject']))

which is unintuitive and hard to teach to new users of the email package. (And looking at the source of __unicode__ it's not even obvious that it's correct — it appears that a space must be added before us-ascii-coded chunks. The joining is non-trivial.)

The same problem occurs in Python 3.3 with str(msg['subject']).

61a746e0-56f6-4f74-bbde-a98c5612db23 commented 11 years ago

An example of the confusion that lack of a clear "convert to unicode" method creates is illustrated by this StackOverflow question: http://stackoverflow.com/q/15516958/1600898

bitdancer commented 11 years ago

I agree that this is not the worlds best API. However, it is the API that we have in 2.7/3.2, and we can't change how Header.__unicode__ behaves without breaking backward compatibility.

What we could do is add an example of how to use this API to get unicode strings to the top of the docs:

   >>>  unicode(make_header(decode_header('=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=')))
   u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'

But you already know about that.

In Python 3.3 you get this:

   >>> msg = message_from_string("subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\n", policy=default)
   >>> msg['subject']
   '这是中文测试！'

So, I'll make this a doc bug.

bitdancer commented 11 years ago

Erg, somehow I failed to read the second half of your message before writing mine...clearly you *didn't* know about that idiom, so the doc patch is obviously an important thing to do.

To clarify about the 3.3 example: the policy=default is key, it tells the email package to use the new (currently provisional) policy code to provide improved handling of header decoding and encoding.

61a746e0-56f6-4f74-bbde-a98c5612db23 commented 11 years ago

Thanks for pointing out the make_header(decode_header(...)) idiom, which I was indeed not aware of. It solves the problem perfectly.

I agree that it is a doc bug. While make_header is documented on the same place as decode_header and Header itself, it is not explained *why* I should call it if I already have in hand a perfectly valid Header instance. Specifically, it is not at all clear that while unicode(h) and unicode(make_header(decode_header(h)) will return different things -- I would have expected make_header(decode_header(h)) to return an object indistinguishable from h.

Also, the policy=default parameter in Python 3 sounds great, it's exactly what one would expect.

9de9bbd3-5ab9-413d-b13c-609591df0d97 commented 2 years ago

[Entry level contributor seeking guidance] If this is still open, I like to work on this.

Also, planning to add the following(if no PR yet created) at make_header API at https://docs.python.org/3/library/email.header.html :

To get unicode strings use the API as shown below: unicode(make_header(decode_header('=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=')))

If email policy parameter is set as 'policy.default' then the default policy, for that Python version, is used for header encoding and decoding.

Please correct me if anything wrong.

JelleZijlstra commented 2 years ago

The messages above are very old and seem to be discussing Python 2. There is no __unicode__ method any more, for example, though there is a __str__ method which presumably does what __unicode__ used to do. It is documented now at https://docs.python.org/3.10/library/email.header.html#email.header.Header.__str__ . You'll have to do some more digging to figure out whether the OP's complaint still applies.

9de9bbd3-5ab9-413d-b13c-609591df0d97 commented 2 years ago

The latest versions 3.9, 3.10 and 3.11 are updated in the issue. So I thought like it still applies.

@irit: Any suggestions on what needs to be done for current revisions?

61a746e0-56f6-4f74-bbde-a98c5612db23 commented 2 years ago

Any suggestions on what needs to be done for current revisions?

Hi! I'm the person who submitted this issue back in 2013. Let's take a look at how things are in Python 3.10:

Python 3.10.2 (main, Jan 13 2022, 19:06:22) [GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n')
>>> msg['Subject']
'=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='

So the headers are still not decoded by default. The unicode() invocation in the original description was just an attempt to get a Unicode string out of a byte string (assuming it was correctly decoded from MIME, which it wasn't). Since Python 3 strings are Unicode already, I'd expect to just get the decoded subject - but that still doesn't happen.

The correct way to make it happen is to specify policy=email.policy.default:

>>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n', policy=email.policy.default)
>>> msg['Subject']
'这是中文测试！'

The docs should point out that you really want to specify the "default" policy (strangely named, since it's not in fact the default). The current docs only say that message_from_string() is "equivalent to Parser().parsestr(s)." and that policy is interpreted "as with the Parser class constructor". The docs of the Parser constructor don't document policy at all, except for the version when it was added.

So, if you want to work for this, my suggestion would be to improve the docs in the following ways:

in message_from_string() docs, explain that policy=email.policy.default is what you want to send to get the headers decoded.
in Parser docs, explain what _class and policy arguments do in the constructor, which policies are possible, etc. (These things seem to be explained in the BytesFeedParser, so you might want to just link to that, or include a shortened version.)

bitdancer commented 2 years ago

The policy is named 'default' because it was intended to become the default two feature releases after the new email code became non-provisional (first: deprecate not specifying an explicit policy, next release make default the default policy and make the deprecation only cover compat32). However, for various reasons that switchover did not happen (one big factor being my reduced time spent doing python development). It can happen any time someone steps forward to guide it through the release process.

9de9bbd3-5ab9-413d-b13c-609591df0d97 commented 2 years ago

@hniksic: Thanks for your suggestions. I will look into BytesFeedParser documents. @david.murray: I can help you for the switch over to the default in the default policy and update the deprecation as well. It will be good if someone can guide me on this. Being a beginner, I am not sure if we are allowed to change the python code.

9de9bbd3-5ab9-413d-b13c-609591df0d97 commented 2 years ago

The PR for the email parser doc update is: https://github.com/python/cpython/pull/31765

Can someone review it pls.

python / cpython

[doc] email.header.Header.unicode does not decode header #61707

python / cpython

[doc] email.header.Header.__unicode__ does not decode header #61707

[doc] email.header.Header.unicode does not decode header #61707