Header.decode_header eats up spaces

328bcddf-bb5b-4172-b1e5-0bb62e5cdf15 commented 18 years ago

BPO	1467619
Nosy	@warsaw, @birkenfeld, @bitdancer
Superseder	bpo-1079: decode_header does not follow RFC 2047
Files	emailheader.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-bug', 'expert-email'] title = 'Header.decode_header eats up spaces' updated_at = user = 'https://bugs.python.org/mgoutell' ``` bugs.python.org fields: ```python activity = actor = 'r.david.murray' assignee = 'none' closed = True closed_date = closer = 'r.david.murray' components = ['email'] creation = creator = 'mgoutell' dependencies = [] files = ['1963'] hgrepos = [] issue_num = 1467619 keywords = [] message_count = 10.0 messages = ['28181', '28182', '28183', '28184', '28185', '114651', '114722', '150459', '150470', '162219'] nosy_count = 7.0 nosy_names = ['barry', 'georg.brandl', 'alexanderweb', 'mgoutell', 'r.david.murray', 'BreamoreBoy', 'runtux'] pr_nums = [] priority = 'normal' resolution = 'duplicate' stage = 'resolved' status = 'closed' superseder = '1079' type = 'behavior' url = 'https://bugs.python.org/issue1467619' versions = ['Python 3.3'] ```

328bcddf-bb5b-4172-b1e5-0bb62e5cdf15 commented 18 years ago

The Header.decode_header function eats up spaces in non-encoded part of a header.

See the following source: # -- coding: iso-8859-1 -- from email.Header import Header, decode_header h = Header('Essai ', None) h.append('éè', 'iso-8859-1') print h print decode_header(h)

This prints: Essai =?iso-8859-1?q?=E9=E8?= [('Test', None), ('\xe9\xe8', 'iso-8859-1')]

This should print: Essai =?iso-8859-1?q?=E9=E8?= [('Test ', None), ('\xe9\xe8', 'iso-8859-1')] ^ This space disappears

This appears in Python 2.3 but the source code of the function didn't change in 2.4 so the same problem should still exist. Bug "[ 1372770 ] email.Header should preserve original FWS" may be linked to that one although I'm not sure this is exactly the same.

This patch (not extensively tested though) seems to solve this problem:

--- /usr/lib/python2.3/email/Header.py  2005-09-05
00:20:03.000000000 +0200
+++ Header.py   2006-04-10 12:27:27.000000000 +0200
@@ -90,7 +90,7 @@
             continue
         parts = ecre.split(line)
         while parts:
-            unenc = parts.pop(0).strip()
+            unenc = parts.pop(0).rstrip()
             if unenc:
                 # Should we continue a long line?
                 if decoded and decoded[-1][1] is None:

b1109610-9b2b-436d-85e2-063ddd2d663f commented 18 years ago

Logged In: YES user_id=254738

I can confirm this bug and have been bitten by it as well.

328bcddf-bb5b-4172-b1e5-0bb62e5cdf15 commented 17 years ago

Hello, Any news about this bug. It seems still there in 2.5 after a one year notice... Regards,

birkenfeld commented 17 years ago

I propose the attached patch. RFC 2047 specifies to ignore whitespace between encoded-words, but IMHO not between ordinary text and encoded-words. File Added: emailheader.diff

warsaw commented 17 years ago

IIRC, I tried the OP's patch and it broke too many of the email package's test suite. I made an attempt at fixing the problem to be much more RFC compliant, but couldn't get the test suite to pass completely. This points to a much deeper problem with email package header management. I don't think the problem is a bug, I think it's a design flaw.

83d2e70e-e599-4a04-b820-3814bbdb9bef commented 14 years ago

Would someone like to comment on Georg's patch.

bitdancer commented 14 years ago

Georg's patch no longer applies to py3k. I ported it, but the result is not functional. It causes extra spaces during header generation, because it is there that email4/5 "deals" with "ignoring" spaces between encoded words by *adding* spaces when they are adjacent to non-encoded words. (In general email4/5 compresses runs of whitespace into single spaces.) I tried fixing that, but then ran in to the fact that header parsing/generation currently depends on the whitespace compression in order to handle the header folding cases. So, the logic used for header parsing and generation in emai5 does not allow for an easy patch to fix this bug. I'm deferring it to email6, where I an rewriting the header parser/generator.

045f8f0d-9bfc-4a14-9319-6b326ddadb21 commented 12 years ago

I've been bitten by this too (in python up to 2.7 in roundup the bug-tracker). We're currently using a workaround that re-inserts spaces, see git on roundup.sourceforge.net file mailgw.py method _decode_header_to_utf8

RFC2047 even has a test-case at the end, it specifies:

encoded form displayed as (=?ISO-8859-1?Q?a?= b) (a b)

note the space between 'a' and 'b' above. Spaces between non-encoded and encoded parts should be preserved. And it's probably a good idea to put the examples from the RFC into the regression test.

bitdancer commented 12 years ago

Antoine, I marked this for Python 3.3 only because there is no good way to fix it in 2.7/3.2. (If someone comes up with a way I'll be happy to review it, though!)

bitdancer commented 12 years ago

This is fixed by the fix in bpo-1079. Ralf found a *relatively* backward compatible way to fix it, but since the point is preserving whitespace that wasn't preserved before, there is an unavoidable behavior change, so it can't be backported.

python / cpython

Header.decode_header eats up spaces #43184