py34 test failure: KeyError: 'Authorization' error in TestSession.test_session_unicode

msabramo commented 9 years ago

py34 test failure: KeyError: 'Authorization' error in TestSession.test_session_unicode

I can reproduce the test_session_unicode failure consistently by explicitly passing a --hashseed to tox:

❯ tox -e py34 --hashseed=1811760512 -- tests/test_sessions.py -k test_session_unicode
GLOB sdist-make: /Users/marca/dev/git-repos/httpie/setup.py
py34 inst-nodeps: /Users/marca/dev/git-repos/httpie/.tox/dist/httpie-0.9.0-dev.zip
py34 runtests: PYTHONHASHSEED='1811760512'
py34 runtests: commands[0] | py.test --verbose --doctest-modules --basetemp=/Users/marca/dev/git-repos/httpie/.tox/py34/tmp tests/test_sessions.py -k test_session_unicode
============================================================================= test session starts ==============================================================================
platform darwin -- Python 3.4.0 -- py-1.4.26 -- pytest-2.6.4 -- /Users/marca/dev/git-repos/httpie/.tox/py34/bin/python3.4
plugins: httpbin
collected 6 items

tests/test_sessions.py::TestSession::test_session_unicode FAILED

=================================================================================== FAILURES ===================================================================================
_______________________________________________________________________ TestSession.test_session_unicode _______________________________________________________________________
Traceback (most recent call last):
  File "/Users/marca/dev/git-repos/httpie/tests/test_sessions.py", line 151, in test_session_unicode
    assert (r2.json['headers']['Authorization']
KeyError: 'Authorization'
----------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------
127.0.0.1 - - [29/Nov/2014 11:40:28] "GET /get HTTP/1.1" 200 301
127.0.0.1 - - [29/Nov/2014 11:40:28] "GET /get HTTP/1.1" 200 301
================================================================ 5 tests deselected by '-ktest_session_unicode' ================================================================
============================================================== 1 failed, 5 deselected, 1 warnings in 0.67 seconds ==============================================================
ERROR: InvocationError: '/Users/marca/dev/git-repos/httpie/.tox/py34/bin/py.test --verbose --doctest-modules --basetemp=/Users/marca/dev/git-repos/httpie/.tox/py34/tmp tests/test_sessions.py -k test_session_unicode'
___________________________________________________________________________________ summary ____________________________________________________________________________________
ERROR:   py34: commands failed

From investigation in https://github.com/jakubroztocil/httpie/issues/278, I've determined that this happens because Python 3.4's HTTP header parsing chokes on the Test header. I think that this is because the Test header contains UTF-8 data, which is not properly encoded.

> /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/wsgiref/simple_server.py(104)get_environ()
-> for k, v in self.headers.items():
(Pdb) str(self.headers)
'Host: 127.0.0.1:61463\nUser-Agent: HTTPie/0.9.0-dev\nAccept: */*\nTest: =?utf-8?b?W29uZSBsaW5lIG9mIFVURjgtZW5jb2RlZCB1bmljb2RlIHRleHRdIMOPwofDj8KBw48=?=\n\nÏ\x83Î±Ï\x86á½¶ å¤ªé\x99½ à¹\x80à¸¥à¸´à¸¨ â\x99\x9câ\x99\x9eâ\x99\x9dâ\x99\x9bâ\x99\x9aâ\x99\x9dâ\x99\x9eâ\x99\x9c Ð¾Ð¶Ð¸Ð²Ð»Ñ\x91Ð½Ð½Ñ\x8bÐ¼ à¤¤à¤¾à¤¨à¥\x8dà¤¯à¤¹à¤¾à¤¨à¤¿ æ\x9c\x89æ\x9c\x8b à®¸à¯\x8dà®±à¯\x80à®©à®¿à®µà®¾à®¸ Ù±Ù\x84Ø±Ù\x8eÙ\x91Ø\xadÙ\x92Ù\x85\nÙ\x80Ù\x8eØ¨Ù\x86Ù\x90\nAccept-Encoding: gzip, deflate\nConnection: keep-alive\nAuthorization: Basic dGVzdDpbb25lIGxpbmUgb2YgVVRGOC1lbmNvZGVkIHVuaWNvZGUgdGV4dF0gz4fPgc+Fz4POsc+G4b22IOWkqumZvSDguYDguKXguLTguKgg4pmc4pme4pmd4pmb4pma4pmd4pme4pmcINC+0LbQuNCy0LvRkdC90L3Ri9C8IOCkpOCkvuCkqOCljeCkr+CkueCkvuCkqOCkvyDmnInmnIsg4K644K+N4K6x4K+A4K6p4K6/4K614K6+4K64INmx2YTYsdmO2ZHYrdmS2YXZgNmO2KjZhtmQ\n\n'
(Pdb) self.headers.items()
[('Host', '127.0.0.1:61463'),
 ('User-Agent', 'HTTPie/0.9.0-dev'),
 ('Accept', '*/*'),
 ('Test', '[one line of UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85')]

Note that you can see the Authorization header in the output of str(self.headers), but it's not showing up in self.headers.items(). And the Test header is severely truncated.

I am suspicious of the Test header:

('Test', '[one line of UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85')]

That Test header is the last one that shows up in self.headers.items(); no header that occurs after it appears -- e.g.: Accept-Encoding, Connection, Authorization

Also the the value is very short so I suspect that parsing is failing midway through and messing up the processing of all subsequent headers.

There's even a "defect" recorded. The email parser mentions in its comments that it doesn't throw exceptions, it records defects instead.

> /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/wsgiref/simple_server.py(104)get_environ()
-> for k, v in self.headers.items():
(Pdb) self.headers
<http.client.HTTPMessage object at 0x106612668>
(Pdb) self.headers.defects
[MissingHeaderBodySeparatorDefect()]

The root cause seems to be that the code in email/feedparser.py chokes on the unicode headers. And the reason why it happens only sometimes is because Python dict, where the request headers are stored, is unordered. So, if the Authorization header comes after Test when it's being serialized (such as when you pass --hashseed=1811760512), it doesn't get parsed correctly at the server side and is therefore missing from httpbin's response.

msabramo commented 9 years ago

Here is what gets received, just before parsing:

> /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/http/client.py(272)parse_headers()-><http.client....t 0x105e0f6a0>
-> return email.parser.Parser(_class=_class).parsestr(hstring)
(Pdb) hstring
'Host: 127.0.0.1:63531\r\nUser-Agent: HTTPie/0.9.0-dev\r\nAccept: */*\r\nTest: [one line of 
UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85Ï\x83Î±Ï\x86á½¶ å¤ªé\x99½ à¹\x80à¸¥à¸´à¸¨ â\x99\x9câ
\x99\x9eâ\x99\x9dâ\x99\x9bâ\x99\x9aâ\x99\x9dâ\x99\x9eâ\x99\x9c Ð¾Ð¶Ð¸Ð²Ð»Ñ\x91Ð½Ð½Ñ
\x8bÐ¼ à¤¤à¤¾à¤¨à¥\x8dà¤¯à¤¹à¤¾à¤¨à¤¿ æ\x9c\x89æ\x9c\x8b à®¸à¯\x8dà®±à¯
\x80à®©à®¿à®µà®¾à®¸ Ù±Ù\x84Ø±Ù\x8eÙ\x91Ø\xadÙ\x92Ù\x85Ù\x80Ù\x8eØ¨Ù\x86Ù
\x90\r\nAccept-Encoding: gzip, deflate\r\nConnection: keep-alive\r\nAuthorization: Basic
 dGVzdDpbb25lIGxpbmUgb2YgVVRGOC1lbmNvZGVkIHVuaWNvZGUgdGV4dF0gz4fPgc+Fz4POsc+
G4b22IOWkqumZvSDguYDguKXguLTguKgg4pmc4pme4pmd4pmb4pma4pmd4pme4pmcINC+0LbQu
NCy0LvRkdC90L3Ri9C8IOCkpOCkvuCkqOCljeCkr+CkueCkvuCkqOCkvyDmnInmnIsg4K644K+N4K6x
4K+A4K6p4K6/4K614K6+4K64INmx2YTYsdmO2ZHYrdmS2YXZgNmO2KjZhtmQ\r\n\r\n'
(Pdb) hstring.encode('unicode_escape')
b'Host: 127.0.0.1:63531\\r\\nUser-Agent: HTTPie/0.9.0-dev\\r\\nAccept: */*\\r\\nTest: [one line of 
UTF8-encoded unicode text] \\xcf\\x87\\xcf\\x81\\xcf\\x85\\xcf\\x83\\xce\\xb1\\xcf\\x86\\xe1\\xbd\\xb6
 \\xe5\\xa4\\xaa\\xe9\\x99\\xbd \\xe0\\xb9\\x80\\xe0\\xb8\\xa5\\xe0\\xb8\\xb4\\xe0\\xb8\\xa8 
\\xe2\\x99\\x9c\\xe2\\x99\\x9e\\xe2\\x99\\x9d\\xe2\\x99\\x9b\\xe2\\x99\\x9a\\xe2\\x99\\x9d\\xe2\\x99
\\x9e\\xe2\\x99\\x9c \\xd0\\xbe\\xd0\\xb6\\xd0\\xb8\\xd0\\xb2\\xd0\\xbb\\xd1\\x91\\xd0\\xbd\\xd0\\xbd\\xd1\\x8b\\xd0\\xbc \\xe0\\xa4\\xa4\\xe0\\xa4\\xbe\\xe0\\xa4\\xa8\\xe0\\xa5\\x8d\\xe0\\xa4\\xaf\\xe0\\xa4\\xb9\\xe0\\xa4\\xbe\\xe0\\xa4\\xa8\\xe0\\xa4\\xbf \\xe6\\x9c\\x89\\xe6\\x9c\\x8b \\xe0\\xae\\xb8\\xe0\\xaf\\x8d\\xe0\\xae\\xb1\\xe0\\xaf\\x80\\xe0\\xae\\xa9\\xe0\\xae\\xbf\\xe0\\xae\\xb5\\xe0\\xae\\xbe\\xe0\\xae\\xb8 \\xd9\\xb1\\xd9\\x84\\xd8\\xb1\\xd9\\x8e\\xd9\\x91\\xd8\\xad\\xd9\\x92\\xd9\\x85\\xd9\\x80\\xd9\\x8e\\xd8\\xa8\\xd9\\x86\\xd9\\x90\\r\\nAccept-Encoding: gzip, deflate\\r\\nConnection: keep-alive\\r\\nAuthorization: Basic dGVzdDpbb25lIGxpbmUgb2YgVVRGOC1lbmNvZGVkIHVuaWNvZGUgdGV4dF0gz4fPgc+Fz4POsc+G
4b22IOWkqumZvSDguYDguKXguLTguKgg4pmc4pme4pmd4pmb4pma4pmd4pme4pmcINC+0LbQuN
Cy0LvRkdC90L3Ri9C8IOCkpOCkvuCkqOCljeCkr+CkueCkvuCkqOCkvyDmnInmnIsg4K644K+N4K6x4
K+A4K6p4K6/4K614K6+4K64INmx2YTYsdmO2ZHYrdmS2YXZgNmO2KjZhtmQ\\r\\n\\r\\n'

From a glance it doesn't look like it's RFC 2047. It looks like it's straight UTF-8:

In [25]: b'Test: [one line of UTF8-encoded unicode text] \xcf\x87\xcf\x81\xcf\x85\xcf\x83\xce\xb1\xcf\x86\xe1\xbd\xb6 \xe5\xa4\xaa\xe9\x99\xbd \xe0\xb9\x80\xe0\xb8\xa5\xe0\xb8\xb4\xe0\xb8\xa8 \xe2\x99\x9c\xe2\x99\x9e\xe2\x99\x9d\xe2\x99\x9b\xe2\x99\x9a\xe2\x99\x9d\xe2\x99\x9e\xe2\x99\x9c \xd0\xbe\xd0\xb6\xd0\xb8\xd0\xb2\xd0\xbb\xd1\x91\xd0\xbd\xd0\xbd\xd1\x8b\xd0\xbc \xe0\xa4\xa4\xe0\xa4\xbe\xe0\xa4\xa8\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\xb9\xe0\xa4\xbe\xe0\xa4\xa8\xe0\xa4\xbf \xe6\x9c\x89\xe6\x9c\x8b \xe0\xae\xb8\xe0\xaf\x8d\xe0\xae\xb1\xe0\xaf\x80\xe0\xae\xa9\xe0\xae\xbf\xe0\xae\xb5\xe0\xae\xbe\xe0\xae\xb8 \xd9\xb1\xd9\x84\xd8\xb1\xd9\x8e\xd9\x91\xd8\xad\xd9\x92\xd9\x85\xd9\x80\xd9\x8e\xd8\xa8\xd9\x86\xd9\x90'.decode('utf-8')
Out[25]: 'Test: [one line of UTF8-encoded unicode text] χρυσαφὶ 太陽 เลิศ ♜♞♝♛♚♝♞♜ оживлённым तान्यहानि 有朋 ஸ்றீனிவாஸ ٱلرَّحْمـَبنِ'

That seems incorrect.

msabramo commented 9 years ago

Reproducing the core problem very simply in an IPython session:

In [44]: import email.parser, http.client

In [45]: hstring = 'Host: 127.0.0.1:63531\r\nUser-Agent: HTTPie/0.9.0-dev\r\nAccept: */*\r\nTest: [one line of UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85Ï\x83Î±Ï\x86á½¶ å¤ªé\x99½ à¹\x80à¸¥à¸´à¸¨ â\x99\x9câ \x99\x9eâ\x99\x9dâ\x99\x9bâ\x99\x9aâ\x99\x9dâ\x99\x9eâ\x99\x9c Ð¾Ð¶Ð¸Ð²Ð»Ñ\x91Ð½Ð½Ñ\x8bÐ¼ à¤¤à¤¾à¤¨à¥\x8dà¤¯à¤¹à¤¾à¤¨à¤¿ æ\x9c\x89æ\x9c\x8b à®¸à¯\x8dà®±à¯\x80à®©à®¿à®µà®¾à®¸ Ù±Ù\x84Ø±Ù\x8eÙ\x91Ø\xadÙ\x92Ù\x85Ù\x80Ù\x8eØ¨Ù\x86Ù\x90\r\nAccept-Encoding: gzip, deflate\r\nConnection: keep-alive\r\nAuthorization: Basic dGVzdDpbb25lIGxpbmUgb2YgVVRGOC1lbmNvZGVkIHVuaWNvZGUgdGV4dF0gz4fPgc+Fz4POsc+G4b22IOWkqumZvSDguYDguKXguLTguKgg4pmc4pme4pmd4pmb4pma4pmd4pme4pmcINC+0LbQuNCy0LvRkdC90L3Ri9C8IOCkpOCkvuCkqOCljeCkr+CkueCkvuCkqOCkvyDmnInmnIsg4K644K+N4K6x4K+A4K6p4K6/4K614K6+4K64INmx2YTYsdmO2ZHYrdmS2YXZgNmO2KjZhtmQ\r\n\r\n'

In [46]: hm = email.parser.Parser(_class=http.client.HTTPMessage).parsestr(hstring)

In [47]: str(hm)
Out[47]: 'Host: 127.0.0.1:63531\nUser-Agent: HTTPie/0.9.0-dev\nAccept: */*\nTest: =?utf-8?b?W29uZSBsaW5lIG9mIFVURjgtZW5jb2RlZCB1bmljb2RlIHRleHRdIMOPwofDj8KBw48=?=\n\nÏ\x83Î±Ï\x86á½¶ å¤ªé\x99½ à¹\x80à¸¥à¸´à¸¨ â\x99\x9câ \x99\x9eâ\x99\x9dâ\x99\x9bâ\x99\x9aâ\x99\x9dâ\x99\x9eâ\x99\x9c Ð¾Ð¶Ð¸Ð²Ð»Ñ\x91Ð½Ð½Ñ\x8bÐ¼ à¤¤à¤¾à¤¨à¥\x8dà¤¯à¤¹à¤¾à¤¨à¤¿ æ\x9c\x89æ\x9c\x8b à®¸à¯\x8dà®±à¯\x80à®©à®¿à®µà®¾à®¸ Ù±Ù\x84Ø±Ù\x8eÙ\x91Ø\xadÙ\x92Ù\x85\nÙ\x80Ù\x8eØ¨Ù\x86Ù\x90\nAccept-Encoding: gzip, deflate\nConnection: keep-alive\nAuthorization: Basic dGVzdDpbb25lIGxpbmUgb2YgVVRGOC1lbmNvZGVkIHVuaWNvZGUgdGV4dF0gz4fPgc+Fz4POsc+G4b22IOWkqumZvSDguYDguKXguLTguKgg4pmc4pme4pmd4pmb4pma4pmd4pme4pmcINC+0LbQuNCy0LvRkdC90L3Ri9C8IOCkpOCkvuCkqOCljeCkr+CkueCkvuCkqOCkvyDmnInmnIsg4K644K+N4K6x4K+A4K6p4K6/4K614K6+4K64INmx2YTYsdmO2ZHYrdmS2YXZgNmO2KjZhtmQ\n\n'

In [48]: hm.items()
Out[48]:
[('Host', '127.0.0.1:63531'),
 ('User-Agent', 'HTTPie/0.9.0-dev'),
 ('Accept', '*/*'),
 ('Test', '[one line of UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85')]

In [49]: hm.defects
Out[49]: [email.errors.MissingHeaderBodySeparatorDefect()]

Perhaps most interesting is that midway through the value of str(hm), in the middle of the value for the Test header, there is a double newline -- \n\n. I could imagine this could cause the parser to choke.

In [82]: str(hm)[146:151]
Out[82]: '=?=\n\n'

msabramo commented 9 years ago

Strangely, if I manually construct the header, things seem to work better:

In [63]: hm2 = http.client.HTTPMessage()

In [64]: hm2.add_header('Test', '[one line of UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85Ï\x83Î±Ï\x86á½¶ å¤ªé\x99½ à¹\x80à¸¥à¸´à¸¨ â\x99\x9câ \x99\x9eâ\x99\x9dâ\x99\x9bâ\x99\x9aâ\x99\x9dâ\x99\x9eâ\x99\x9c Ð¾Ð¶Ð¸Ð²Ð»Ñ\x91Ð½Ð½Ñ\x8bÐ¼ à¤¤à¤¾à¤¨à¥\x8dà¤¯à¤¹à¤¾à¤¨à¤¿ æ\x9c\x89æ\x9c\x8b à®¸à¯\x8dà®±à¯\x80à®©à®¿à®µà®¾à®¸ Ù±Ù\x84Ø±Ù\x8eÙ\x91Ø\xadÙ\x92Ù\x85Ù\x80Ù\x8eØ¨Ù\x86Ù\x90')

In [65]: str(hm2)
Out[65]: 'Test: =?utf-8?b?W29uZSBsaW5lIG9mIFVURjgtZW5jb2RlZCB1bmljb2RlIHRleHRdIMOPwofDj8KBw48=?=\n =?utf-8?b?IMOPwoPDjsKxw4/ChsOhwr3CtiDDpcKkwqrDqcKZwr0gw6DCucKAw6DCuMKlw6DCuMK0w6DCuMKoIMOiwpnCnMOiIMKZwp7DosKZwp3DosKZwpvDosKZwprDosKZwp3DosKZwp7DosKZwpwgw5DCvsOQwrbDkMK4w5DCssOQwrvDkcKRw5DCvcOQwr3DkcKLw5DCvCDDoMKkwqTDoMKkwr7DoMKkwqjDoMKlwo3DoMKkwq/DoMKkwrnDoMKkwr7DoMKkwqjDoMKkwr8gw6bCnMKJw6bCnMKLIMOgwq7CuMOgwq/CjcOgwq7CscOgwq/CgMOgwq7CqcOgwq7Cv8Ogwq7CtcOgwq7CvsOgwq7CuCDDmcKxw5nChMOYwrHDmcKOw5nCkcOYwq3DmcKSw5k=?=\n =?utf-8?b?IMOZwoDDmcKOw5jCqMOZwobDmcKQ?=\n\n'

In [66]: hm2.items()
Out[66]:
[('Test',
  '[one line of UTF8-encoded unicode text] Ï\x87Ï\x81Ï\x85Ï\x83Î±Ï\x86á½¶ å¤ªé\x99½ à¹\x80à¸¥à¸´à¸¨ â\x99\x9câ \x99\x9eâ\x99\x9dâ\x99\x9bâ\x99\x9aâ\x99\x9dâ\x99\x9eâ\x99\x9c Ð¾Ð¶Ð¸Ð²Ð»Ñ\x91Ð½Ð½Ñ\x8bÐ¼ à¤¤à¤¾à¤¨à¥\x8dà¤¯à¤¹à¤¾à¤¨à¤¿ æ\x9c\x89æ\x9c\x8b à®¸à¯\x8dà®±à¯\x80à®©à®¿à®µà®¾à®¸ Ù±Ù\x84Ø±Ù\x8eÙ\x91Ø\xadÙ\x92Ù\x85Ù\x80Ù\x8eØ¨Ù\x86Ù\x90')]

In [67]: hm2.defects
Out[67]: []

Note how in this case, str(hm2) ends up having two chunks of RFC 2047 text, denoted by =?utf-8?, whereas the previous example had only one (previous example seems to have \n\n in that place, which seems like it could totally confuse the parser...). End result is that hm2.items() returns a much longer value for the Test header.

It is curious that I was able to call add_header and have things work, but somehow this is not working in the original code path.

msabramo commented 9 years ago

The httpie.client.encode_headers function is currently encoding to utf-8. From my understanding of the RFC, this doesn't seem right? Perhaps we should be using the RFC 2047 style encoding that the email.header module implements?

See: https://github.com/jakubroztocil/httpie/pull/281 -- tests are failing though.

I cc'd flufl @warsaw, because he has his name on a lot of the stdlib code for email and HTTP header parsing.

msabramo commented 9 years ago

I think I'm going to take a break from this issue for a while, so anyone else who wants to dive in, feel free.

msabramo commented 9 years ago

Anyone have any ideas on how to tackle this?

jkbrzt commented 9 years ago

@msabramo it looks like the right way to go about this would be to switch the approach you tried in #281. (Btw, #212 provides some more context.)

httpie / cli

py34 test failure: KeyError: 'Authorization' error in TestSession.test_session_unicode #282