internetarchive / warcprox

WARC writing MITM HTTP/S proxy
378 stars 54 forks source link

Continue request when http.client.IncompleteRead is raised #123

Closed vbanos closed 5 years ago

vbanos commented 5 years ago

Some times, due to an HTTP Server problem, http.client raises an IncompleteRead exception and fails while reading data from a target URL. Example: http://www.alumni.weber.edu/

curl and browsers work correctly with the same URL.

$ curl -X GET http://www.alumni.weber.edu/
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a
href="https://www.alumni.weber.edu/">here</a>.</h2>
</body></html>
curl: (18) transfer closed with outstanding read data remaining

Notice there is a warning but the page is downloaded correctly.

The warcprox exception when trying to download http://www.alumni.weber.edu/ is:

2019-04-13 16:01:05,271 19278 ERROR
MitmProxyHandler(tid=5761,started=2019-04-13T16:01:05.024898,client=127.0.0.1:46234)
warcprox.warcprox.WarcProxyHandler.do_COMMAND(mitmproxy.py:407) error
from remote server(?) 'GET http://www.alumni.weber.edu/ HTTP/1.1':
IncompleteRead(146 bytes read)
Traceback (most recent call last):
  File
"/home/vbanos/.pyenv/versions/3.5.2/lib/python3.5/http/client.py", line
541, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File
"/home/vbanos/.pyenv/versions/3.5.2/lib/python3.5/http/client.py", line
508, in _read_next_chunk_size
    return int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File
"/home/vbanos/.pyenv/versions/3.5.2/lib/python3.5/http/client.py", line
573, in _readinto_chunked
    chunk_left = self._get_chunk_left()
  File
"/home/vbanos/.pyenv/versions/3.5.2/lib/python3.5/http/client.py", line
543, in _get_chunk_left
    raise IncompleteRead(b'')
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File
"/opt/spn/lib/python3.5/site-packages/warcprox-2.4.3-py3.5.egg/warcprox/mitmproxy.py",
line 397, in do_COMMAND
    return self._proxy_request()
  File
"/opt/spn/lib/python3.5/site-packages/warcprox-2.4.3-py3.5.egg/warcprox/warcproxy.py",
line 211, in _proxy_request
    self, extra_response_headers=extra_response_headers)
  File
"/opt/spn/lib/python3.5/site-packages/warcprox-2.4.3-py3.5.egg/warcprox/mitmproxy.py",
line 437, in _proxy_request
    return self._inner_proxy_request(extra_response_headers)
  File
"/opt/spn/lib/python3.5/site-packages/warcprox-2.4.3-py3.5.egg/warcprox/mitmproxy.py",
line 496, in _inner_proxy_request
    buf = prox_rec_res.read(65536)
  File
"/opt/spn/lib/python3.5/site-packages/warcprox-2.4.3-py3.5.egg/warcprox/mitmproxy.py",
line 198, in read
    buf = http_client.HTTPResponse.read(self, amt)
  File
"/home/vbanos/.pyenv/versions/3.5.2/lib/python3.5/http/client.py", line
448, in read
    n = self.readinto(b)
  File
"/home/vbanos/.pyenv/versions/3.5.2/lib/python3.5/http/client.py", line
478, in readinto
    return self._readinto_chunked(b)
  File
"/home/vbanos/.pyenv/versions/3.5.2/lib/python3.5/http/client.py", line
589, in _readinto_chunked
    raise IncompleteRead(bytes(b[0:total_bytes]))
http.client.IncompleteRead: IncompleteRead(146 bytes read)

In this PR, we add exception handling for http.client.IncompleteRead aiming to continue the request when it happens.

curl now behaves exactly the same with or without using warcprox.

export http_proxy=http://localhost:8888/; curl -X GET
http://www.alumni.weber.edu/
nlevitt commented 5 years ago

Closing in favor of #124