internetarchive / warcprox

WARC writing MITM HTTP/S proxy
378 stars 54 forks source link

Increase the MAXHEADERS limit of http client #119

Closed vbanos closed 5 years ago

vbanos commented 5 years ago

http.client has an arbitrary limit of MAXHEADERS=100. If a target URL has more it raises an HTTPException and the request fails. (The target pages are perfectly fine besides having more than 100 headers). https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L113

We increase this limit to 7000. We currently use this in production WBM. We bumped into the same issue trying to replay pages with too many HTTP headers. We increased the limit progressively from 100 to 500, 1000 etc and we found that 7000 is a good place to stop.

vbanos commented 5 years ago

Example page with 150 HTTP headers which is crashing current warcprox http://vbanos.gr/manyheaders.php

nlevitt commented 5 years ago

For the record, this is what happens currently. After the error warcprox continues working normally. ("Crash" is a strong and somewhat misleading word for this 🤓) Thanks for the test url @vbanos.

2019-04-08 11:27:36,226 60832 ERROR MitmProxyHandler(tid=n/a,started=2019-04-08T18:27:35.895898,client=127.0.0.1:51256) warcprox.warcprox.WarcProxyHandler.do_COMMAND(mitmproxy.py:396) error from remote server(?) 'GET http://vbanos.gr/manyheaders.php HTTP/1.1': HTTPException('got more than 100 headers')
Traceback (most recent call last):
  File "/Users/nlevitt/workspace/warcprox/warcprox/mitmproxy.py", line 386, in do_COMMAND
    return self._proxy_request()
  File "/Users/nlevitt/workspace/warcprox/warcprox/warcproxy.py", line 211, in _proxy_request
    self, extra_response_headers=extra_response_headers)
  File "/Users/nlevitt/workspace/warcprox/warcprox/mitmproxy.py", line 422, in _proxy_request
    return self._inner_proxy_request(extra_response_headers)
  File "/Users/nlevitt/workspace/warcprox/warcprox/mitmproxy.py", line 479, in _inner_proxy_request
    prox_rec_res.begin(extra_response_headers=extra_response_headers)
  File "/Users/nlevitt/workspace/warcprox/warcprox/mitmproxy.py", line 170, in begin
    http_client.HTTPResponse.begin(self)  # reads status line, headers
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 320, in begin
    self.headers = self.msg = parse_headers(self.fp)
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 210, in parse_headers
    raise HTTPException("got more than %d headers" % _MAXHEADERS)
http.client.HTTPException: got more than 100 headers
2019-04-08 11:27:36,230 60832 WARNING MitmProxyHandler(tid=n/a,started=2019-04-08T18:27:35.895898,client=127.0.0.1:51256) warcprox.warcprox.WarcProxyHandler.log_error(mitmproxy.py:524) code 502, message got more than 100 headers
nlevitt commented 5 years ago

Thanks!