Closed msabramo closed 8 years ago
Here is what gets received, just before parsing:
> /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py(272)parse_headers()-><http.client....t 0x105e0f6a0>
-> return email.parser.Parser(_class=_class).parsestr(hstring)
(Pdb) hstring
'Host: 127.0.0.1:63531\r\nUser-Agent: HTTPie/0.9.0-dev\r\nAccept: */*\r\nTest: [one line of
UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85Ï\x83αÏ\x86ὶ 太é\x99½ à¹\x80ลิศ â\x99\x9câ
\x99\x9eâ\x99\x9dâ\x99\x9bâ\x99\x9aâ\x99\x9dâ\x99\x9eâ\x99\x9c оживлÑ\x91ннÑ
\x8bм तानà¥\x8dयहानि æ\x9c\x89æ\x9c\x8b ஸà¯\x8dà®±à¯
\x80னிவாஸ Ù±Ù\x84رÙ\x8eÙ\x91Ø\xadÙ\x92Ù\x85Ù\x80Ù\x8eبÙ\x86Ù
\x90\r\nAccept-Encoding: gzip, deflate\r\nConnection: keep-alive\r\nAuthorization: Basic
dGVzdDpbb25lIGxpbmUgb2YgVVRGOC1lbmNvZGVkIHVuaWNvZGUgdGV4dF0gz4fPgc+Fz4POsc+
G4b22IOWkqumZvSDguYDguKXguLTguKgg4pmc4pme4pmd4pmb4pma4pmd4pme4pmcINC+0LbQu
NCy0LvRkdC90L3Ri9C8IOCkpOCkvuCkqOCljeCkr+CkueCkvuCkqOCkvyDmnInmnIsg4K644K+N4K6x
4K+A4K6p4K6/4K614K6+4K64INmx2YTYsdmO2ZHYrdmS2YXZgNmO2KjZhtmQ\r\n\r\n'
(Pdb) hstring.encode('unicode_escape')
b'Host: 127.0.0.1:63531\\r\\nUser-Agent: HTTPie/0.9.0-dev\\r\\nAccept: */*\\r\\nTest: [one line of
UTF8-encoded unicode text] \\xcf\\x87\\xcf\\x81\\xcf\\x85\\xcf\\x83\\xce\\xb1\\xcf\\x86\\xe1\\xbd\\xb6
\\xe5\\xa4\\xaa\\xe9\\x99\\xbd \\xe0\\xb9\\x80\\xe0\\xb8\\xa5\\xe0\\xb8\\xb4\\xe0\\xb8\\xa8
\\xe2\\x99\\x9c\\xe2\\x99\\x9e\\xe2\\x99\\x9d\\xe2\\x99\\x9b\\xe2\\x99\\x9a\\xe2\\x99\\x9d\\xe2\\x99
\\x9e\\xe2\\x99\\x9c \\xd0\\xbe\\xd0\\xb6\\xd0\\xb8\\xd0\\xb2\\xd0\\xbb\\xd1\\x91\\xd0\\xbd\\xd0\\xbd\\xd1\\x8b\\xd0\\xbc \\xe0\\xa4\\xa4\\xe0\\xa4\\xbe\\xe0\\xa4\\xa8\\xe0\\xa5\\x8d\\xe0\\xa4\\xaf\\xe0\\xa4\\xb9\\xe0\\xa4\\xbe\\xe0\\xa4\\xa8\\xe0\\xa4\\xbf \\xe6\\x9c\\x89\\xe6\\x9c\\x8b \\xe0\\xae\\xb8\\xe0\\xaf\\x8d\\xe0\\xae\\xb1\\xe0\\xaf\\x80\\xe0\\xae\\xa9\\xe0\\xae\\xbf\\xe0\\xae\\xb5\\xe0\\xae\\xbe\\xe0\\xae\\xb8 \\xd9\\xb1\\xd9\\x84\\xd8\\xb1\\xd9\\x8e\\xd9\\x91\\xd8\\xad\\xd9\\x92\\xd9\\x85\\xd9\\x80\\xd9\\x8e\\xd8\\xa8\\xd9\\x86\\xd9\\x90\\r\\nAccept-Encoding: gzip, deflate\\r\\nConnection: keep-alive\\r\\nAuthorization: Basic dGVzdDpbb25lIGxpbmUgb2YgVVRGOC1lbmNvZGVkIHVuaWNvZGUgdGV4dF0gz4fPgc+Fz4POsc+G
4b22IOWkqumZvSDguYDguKXguLTguKgg4pmc4pme4pmd4pmb4pma4pmd4pme4pmcINC+0LbQuN
Cy0LvRkdC90L3Ri9C8IOCkpOCkvuCkqOCljeCkr+CkueCkvuCkqOCkvyDmnInmnIsg4K644K+N4K6x4
K+A4K6p4K6/4K614K6+4K64INmx2YTYsdmO2ZHYrdmS2YXZgNmO2KjZhtmQ\\r\\n\\r\\n'
From a glance it doesn't look like it's RFC 2047. It looks like it's straight UTF-8:
In [25]: b'Test: [one line of UTF8-encoded unicode text] \xcf\x87\xcf\x81\xcf\x85\xcf\x83\xce\xb1\xcf\x86\xe1\xbd\xb6 \xe5\xa4\xaa\xe9\x99\xbd \xe0\xb9\x80\xe0\xb8\xa5\xe0\xb8\xb4\xe0\xb8\xa8 \xe2\x99\x9c\xe2\x99\x9e\xe2\x99\x9d\xe2\x99\x9b\xe2\x99\x9a\xe2\x99\x9d\xe2\x99\x9e\xe2\x99\x9c \xd0\xbe\xd0\xb6\xd0\xb8\xd0\xb2\xd0\xbb\xd1\x91\xd0\xbd\xd0\xbd\xd1\x8b\xd0\xbc \xe0\xa4\xa4\xe0\xa4\xbe\xe0\xa4\xa8\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\xb9\xe0\xa4\xbe\xe0\xa4\xa8\xe0\xa4\xbf \xe6\x9c\x89\xe6\x9c\x8b \xe0\xae\xb8\xe0\xaf\x8d\xe0\xae\xb1\xe0\xaf\x80\xe0\xae\xa9\xe0\xae\xbf\xe0\xae\xb5\xe0\xae\xbe\xe0\xae\xb8 \xd9\xb1\xd9\x84\xd8\xb1\xd9\x8e\xd9\x91\xd8\xad\xd9\x92\xd9\x85\xd9\x80\xd9\x8e\xd8\xa8\xd9\x86\xd9\x90'.decode('utf-8')
Out[25]: 'Test: [one line of UTF8-encoded unicode text] χρυσαφὶ 太陽 เลิศ ♜♞♝♛♚♝♞♜ оживлённым तान्यहानि 有朋 ஸ்றீனிவாஸ ٱلرَّحْمـَبنِ'
That seems incorrect.
Reproducing the core problem very simply in an IPython session:
In [44]: import email.parser, http.client
In [45]: hstring = 'Host: 127.0.0.1:63531\r\nUser-Agent: HTTPie/0.9.0-dev\r\nAccept: */*\r\nTest: [one line of UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85Ï\x83αÏ\x86ὶ 太é\x99½ à¹\x80ลิศ â\x99\x9câ \x99\x9eâ\x99\x9dâ\x99\x9bâ\x99\x9aâ\x99\x9dâ\x99\x9eâ\x99\x9c оживлÑ\x91ннÑ\x8bм तानà¥\x8dयहानि æ\x9c\x89æ\x9c\x8b ஸà¯\x8dà®±à¯\x80னிவாஸ Ù±Ù\x84رÙ\x8eÙ\x91Ø\xadÙ\x92Ù\x85Ù\x80Ù\x8eبÙ\x86Ù\x90\r\nAccept-Encoding: gzip, deflate\r\nConnection: keep-alive\r\nAuthorization: Basic dGVzdDpbb25lIGxpbmUgb2YgVVRGOC1lbmNvZGVkIHVuaWNvZGUgdGV4dF0gz4fPgc+Fz4POsc+G4b22IOWkqumZvSDguYDguKXguLTguKgg4pmc4pme4pmd4pmb4pma4pmd4pme4pmcINC+0LbQuNCy0LvRkdC90L3Ri9C8IOCkpOCkvuCkqOCljeCkr+CkueCkvuCkqOCkvyDmnInmnIsg4K644K+N4K6x4K+A4K6p4K6/4K614K6+4K64INmx2YTYsdmO2ZHYrdmS2YXZgNmO2KjZhtmQ\r\n\r\n'
In [46]: hm = email.parser.Parser(_class=http.client.HTTPMessage).parsestr(hstring)
In [47]: str(hm)
Out[47]: 'Host: 127.0.0.1:63531\nUser-Agent: HTTPie/0.9.0-dev\nAccept: */*\nTest: =?utf-8?b?W29uZSBsaW5lIG9mIFVURjgtZW5jb2RlZCB1bmljb2RlIHRleHRdIMOPwofDj8KBw48=?=\n\nÏ\x83αÏ\x86ὶ 太é\x99½ à¹\x80ลิศ â\x99\x9câ \x99\x9eâ\x99\x9dâ\x99\x9bâ\x99\x9aâ\x99\x9dâ\x99\x9eâ\x99\x9c оживлÑ\x91ннÑ\x8bм तानà¥\x8dयहानि æ\x9c\x89æ\x9c\x8b ஸà¯\x8dà®±à¯\x80னிவாஸ Ù±Ù\x84رÙ\x8eÙ\x91Ø\xadÙ\x92Ù\x85\nÙ\x80Ù\x8eبÙ\x86Ù\x90\nAccept-Encoding: gzip, deflate\nConnection: keep-alive\nAuthorization: Basic dGVzdDpbb25lIGxpbmUgb2YgVVRGOC1lbmNvZGVkIHVuaWNvZGUgdGV4dF0gz4fPgc+Fz4POsc+G4b22IOWkqumZvSDguYDguKXguLTguKgg4pmc4pme4pmd4pmb4pma4pmd4pme4pmcINC+0LbQuNCy0LvRkdC90L3Ri9C8IOCkpOCkvuCkqOCljeCkr+CkueCkvuCkqOCkvyDmnInmnIsg4K644K+N4K6x4K+A4K6p4K6/4K614K6+4K64INmx2YTYsdmO2ZHYrdmS2YXZgNmO2KjZhtmQ\n\n'
In [48]: hm.items()
Out[48]:
[('Host', '127.0.0.1:63531'),
('User-Agent', 'HTTPie/0.9.0-dev'),
('Accept', '*/*'),
('Test', '[one line of UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85')]
In [49]: hm.defects
Out[49]: [email.errors.MissingHeaderBodySeparatorDefect()]
Perhaps most interesting is that midway through the value of str(hm)
, in the middle of the value for the Test
header, there is a double newline -- \n\n
. I could imagine this could cause the parser to choke.
In [82]: str(hm)[146:151]
Out[82]: '=?=\n\n'
Strangely, if I manually construct the header, things seem to work better:
In [63]: hm2 = http.client.HTTPMessage()
In [64]: hm2.add_header('Test', '[one line of UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85Ï\x83αÏ\x86ὶ 太é\x99½ à¹\x80ลิศ â\x99\x9câ \x99\x9eâ\x99\x9dâ\x99\x9bâ\x99\x9aâ\x99\x9dâ\x99\x9eâ\x99\x9c оживлÑ\x91ннÑ\x8bм तानà¥\x8dयहानि æ\x9c\x89æ\x9c\x8b ஸà¯\x8dà®±à¯\x80னிவாஸ Ù±Ù\x84رÙ\x8eÙ\x91Ø\xadÙ\x92Ù\x85Ù\x80Ù\x8eبÙ\x86Ù\x90')
In [65]: str(hm2)
Out[65]: 'Test: =?utf-8?b?W29uZSBsaW5lIG9mIFVURjgtZW5jb2RlZCB1bmljb2RlIHRleHRdIMOPwofDj8KBw48=?=\n =?utf-8?b?IMOPwoPDjsKxw4/ChsOhwr3CtiDDpcKkwqrDqcKZwr0gw6DCucKAw6DCuMKlw6DCuMK0w6DCuMKoIMOiwpnCnMOiIMKZwp7DosKZwp3DosKZwpvDosKZwprDosKZwp3DosKZwp7DosKZwpwgw5DCvsOQwrbDkMK4w5DCssOQwrvDkcKRw5DCvcOQwr3DkcKLw5DCvCDDoMKkwqTDoMKkwr7DoMKkwqjDoMKlwo3DoMKkwq/DoMKkwrnDoMKkwr7DoMKkwqjDoMKkwr8gw6bCnMKJw6bCnMKLIMOgwq7CuMOgwq/CjcOgwq7CscOgwq/CgMOgwq7CqcOgwq7Cv8Ogwq7CtcOgwq7CvsOgwq7CuCDDmcKxw5nChMOYwrHDmcKOw5nCkcOYwq3DmcKSw5k=?=\n =?utf-8?b?IMOZwoDDmcKOw5jCqMOZwobDmcKQ?=\n\n'
In [66]: hm2.items()
Out[66]:
[('Test',
'[one line of UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85Ï\x83αÏ\x86ὶ 太é\x99½ à¹\x80ลิศ â\x99\x9câ \x99\x9eâ\x99\x9dâ\x99\x9bâ\x99\x9aâ\x99\x9dâ\x99\x9eâ\x99\x9c оживлÑ\x91ннÑ\x8bм तानà¥\x8dयहानि æ\x9c\x89æ\x9c\x8b ஸà¯\x8dà®±à¯\x80னிவாஸ Ù±Ù\x84رÙ\x8eÙ\x91Ø\xadÙ\x92Ù\x85Ù\x80Ù\x8eبÙ\x86Ù\x90')]
In [67]: hm2.defects
Out[67]: []
Note how in this case, str(hm2)
ends up having two chunks of RFC 2047 text, denoted by =?utf-8?
, whereas the previous example had only one (previous example seems to have \n\n
in that place, which seems like it could totally confuse the parser...). End result is that hm2.items()
returns a much longer value for the Test
header.
It is curious that I was able to call add_header
and have things work, but somehow this is not working in the original code path.
The httpie.client.encode_headers
function is currently encoding to utf-8
. From my understanding of the RFC, this doesn't seem right? Perhaps we should be using the RFC 2047 style encoding that the email.header
module implements?
See: https://github.com/jakubroztocil/httpie/pull/281 -- tests are failing though.
I cc'd flufl @warsaw, because he has his name on a lot of the stdlib code for email and HTTP header parsing.
I think I'm going to take a break from this issue for a while, so anyone else who wants to dive in, feel free.
Anyone have any ideas on how to tackle this?
@msabramo it looks like the right way to go about this would be to switch the approach you tried in #281. (Btw, #212 provides some more context.)
py34 test failure:
KeyError: 'Authorization'
error inTestSession.test_session_unicode
I can reproduce the
test_session_unicode
failure consistently by explicitly passing a--hashseed
to tox:From investigation in https://github.com/jakubroztocil/httpie/issues/278, I've determined that this happens because Python 3.4's HTTP header parsing chokes on the
Test
header. I think that this is because theTest
header contains UTF-8 data, which is not properly encoded.Note that you can see the
Authorization
header in the output ofstr(self.headers)
, but it's not showing up inself.headers.items()
. And theTest
header is severely truncated.I am suspicious of the
Test
header:That
Test
header is the last one that shows up inself.headers.items()
; no header that occurs after it appears -- e.g.:Accept-Encoding
,Connection
,Authorization
Also the the value is very short so I suspect that parsing is failing midway through and messing up the processing of all subsequent headers.
There's even a "defect" recorded. The email parser mentions in its comments that it doesn't throw exceptions, it records defects instead.
The root cause seems to be that the code in
email/feedparser.py
chokes on the unicode headers. And the reason why it happens only sometimes is because Pythondict
, where the request headers are stored, is unordered. So, if theAuthorization
header comes afterTest
when it's being serialized (such as when you pass--hashseed=1811760512
), it doesn't get parsed correctly at the server side and is therefore missing from httpbin's response.