Open 61a746e0-56f6-4f74-bbde-a98c5612db23 opened 11 years ago
The __unicode__ method is documented to "return the header as a Unicode string". For this to be useful, I would expect it to decode a string such as "=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=" into a Unicode string that can be displayed to the user, in this case u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'.
However, unicode(header) returns the not so useful u"=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=". Looking at the code of __unicode, it appears that the code does attempt to decode the header into Unicode, but this fails for Headers initialized from a single MIME-quoted string, as is done by the message parser. In other words, __unicode is failing to call decode_header.
Here is a minimal example demonstrating the problem:
>>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n')
>>> unicode(msg['subject'])
u'=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='
Expected output of the last line: u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'
To get the fully decoded Unicode string, one must use something like:
>>> u''.join(unicode(s, c) for s, c in email.header.decode_header(msg['subject']))
which is unintuitive and hard to teach to new users of the email package. (And looking at the source of __unicode__ it's not even obvious that it's correct — it appears that a space must be added before us-ascii-coded chunks. The joining is non-trivial.)
The same problem occurs in Python 3.3 with str(msg['subject']).
An example of the confusion that lack of a clear "convert to unicode" method creates is illustrated by this StackOverflow question: http://stackoverflow.com/q/15516958/1600898
I agree that this is not the worlds best API. However, it is the API that we have in 2.7/3.2, and we can't change how Header.__unicode__ behaves without breaking backward compatibility.
What we could do is add an example of how to use this API to get unicode strings to the top of the docs:
>>> unicode(make_header(decode_header('=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=')))
u'\u8fd9\u662f\u4e2d\u6587\u6d4b\u8bd5\uff01'
But you already know about that.
In Python 3.3 you get this:
>>> msg = message_from_string("subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\n", policy=default)
>>> msg['subject']
'这是中文测试!'
So, I'll make this a doc bug.
Erg, somehow I failed to read the second half of your message before writing mine...clearly you *didn't* know about that idiom, so the doc patch is obviously an important thing to do.
To clarify about the 3.3 example: the policy=default is key, it tells the email package to use the new (currently provisional) policy code to provide improved handling of header decoding and encoding.
Thanks for pointing out the make_header(decode_header(...)) idiom, which I was indeed not aware of. It solves the problem perfectly.
I agree that it is a doc bug. While make_header is documented on the same place as decode_header and Header itself, it is not explained *why* I should call it if I already have in hand a perfectly valid Header instance. Specifically, it is not at all clear that while unicode(h) and unicode(make_header(decode_header(h)) will return different things -- I would have expected make_header(decode_header(h)) to return an object indistinguishable from h.
Also, the policy=default parameter in Python 3 sounds great, it's exactly what one would expect.
[Entry level contributor seeking guidance] If this is still open, I like to work on this.
Also, planning to add the following(if no PR yet created) at make_header API at https://docs.python.org/3/library/email.header.html :
To get unicode strings use the API as shown below: unicode(make_header(decode_header('=?gb2312?b?1eLKx9bQzsSy4srUo6E=?=')))
If email policy parameter is set as 'policy.default' then the default policy, for that Python version, is used for header encoding and decoding.
Please correct me if anything wrong.
The messages above are very old and seem to be discussing Python 2. There is no __unicode__
method any more, for example, though there is a __str__
method which presumably does what __unicode__
used to do. It is documented now at https://docs.python.org/3.10/library/email.header.html#email.header.Header.__str__ . You'll have to do some more digging to figure out whether the OP's complaint still applies.
The latest versions 3.9, 3.10 and 3.11 are updated in the issue. So I thought like it still applies.
@irit: Any suggestions on what needs to be done for current revisions?
Any suggestions on what needs to be done for current revisions?
Hi! I'm the person who submitted this issue back in 2013. Let's take a look at how things are in Python 3.10:
Python 3.10.2 (main, Jan 13 2022, 19:06:22) [GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n')
>>> msg['Subject']
'=?gb2312?b?1eLKx9bQzsSy4srUo6E=?='
So the headers are still not decoded by default. The unicode()
invocation in the original description was just an attempt to get a Unicode string out of a byte string (assuming it was correctly decoded from MIME, which it wasn't). Since Python 3 strings are Unicode already, I'd expect to just get the decoded subject - but that still doesn't happen.
The correct way to make it happen is to specify policy=email.policy.default
:
>>> msg = email.message_from_string('Subject: =?gb2312?b?1eLKx9bQzsSy4srUo6E=?=\n\nfoo\n', policy=email.policy.default)
>>> msg['Subject']
'这是中文测试!'
The docs should point out that you really want to specify the "default" policy (strangely named, since it's not in fact the default). The current docs only say that message_from_string()
is "equivalent to Parser().parsestr(s)." and that policy
is interpreted "as with the Parser class constructor". The docs of the Parser constructor don't document policy
at all, except for the version when it was added.
So, if you want to work for this, my suggestion would be to improve the docs in the following ways:
in message_from_string() docs, explain that policy=email.policy.default
is what you want to send to get the headers decoded.
in Parser docs, explain what _class and policy arguments do in the constructor, which policies are possible, etc. (These things seem to be explained in the BytesFeedParser, so you might want to just link to that, or include a shortened version.)
The policy is named 'default' because it was intended to become the default two feature releases after the new email code became non-provisional (first: deprecate not specifying an explicit policy, next release make default the default policy and make the deprecation only cover compat32). However, for various reasons that switchover did not happen (one big factor being my reduced time spent doing python development). It can happen any time someone steps forward to guide it through the release process.
@hniksic: Thanks for your suggestions. I will look into BytesFeedParser documents. @david.murray: I can help you for the switch over to the default in the default policy and update the deprecation as well. It will be good if someone can guide me on this. Being a beginner, I am not sure if we are allowed to change the python code.
The PR for the email parser doc update is: https://github.com/python/cpython/pull/31765
Can someone review it pls.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['easy', 'type-bug', 'expert-email', '3.10', '3.11', '3.9', 'docs']
title = '[doc] email.header.Header.__unicode__ does not decode header'
updated_at =
user = 'https://github.com/hniksic'
```
bugs.python.org fields:
```python
activity =
actor = 'vidhya'
assignee = 'docs@python'
closed = False
closed_date = None
closer = None
components = ['Documentation', 'email']
creation =
creator = 'hniksic'
dependencies = []
files = []
hgrepos = []
issue_num = 17505
keywords = ['patch', 'easy']
message_count = 12.0
messages = ['184856', '184894', '184896', '184897', '185028', '414228', '414234', '414273', '414508', '414530', '414542', '414758']
nosy_count = 6.0
nosy_names = ['barry', 'hniksic', 'r.david.murray', 'docs@python', 'JelleZijlstra', 'vidhya']
pr_nums = ['31765']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue17505'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']
```