ukwa / ukwa-pywb

GNU General Public License v3.0
11 stars 3 forks source link

Occasion URLs throwing RemoteDisconnected errors because CDX response is huge #119

Open anjackson opened 1 year ago

anjackson commented 1 year ago

Some URLs, e.g.

https://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.snp.org%2Fblog%2Fpost%2F2012%2Ffeb%2Fscottish-independence-good-england&layout=button_count&show_faces=false&action=like&colorscheme=light&width=100&height=21&font&locale

Hangs and then fails, like this:

access_website_pywb.1.mq0w6tuyp9j4@prod2    | DAMN ! worker 2 (pid: 10865) died, killed by signal 9 :( trying respawn ...
access_website_pywb.1.mq0w6tuyp9j4@prod2    | Traceback (most recent call last):
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     httplib_response = self._make_request(
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     six.raise_from(e, None)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "<string>", line 3, in raise_from
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     httplib_response = conn.getresponse()
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     response.begin()
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     version, status, reason = self._read_status()
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     raise RemoteDisconnected("Remote end closed connection without"
access_website_pywb.1.mq0w6tuyp9j4@prod2    | http.client.RemoteDisconnected: Remote end closed connection without response
access_website_pywb.1.mq0w6tuyp9j4@prod2    |
access_website_pywb.1.mq0w6tuyp9j4@prod2    | During handling of the above exception, another exception occurred:
access_website_pywb.1.mq0w6tuyp9j4@prod2    |
access_website_pywb.1.mq0w6tuyp9j4@prod2    | Traceback (most recent call last):
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 489, in send
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     resp = conn.urlopen(
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 787, in urlopen
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     retries = retries.increment(
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     raise six.reraise(type(error), error, _stacktrace)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/urllib3/packages/six.py", line 769, in reraise
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     raise value.with_traceback(tb)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     httplib_response = self._make_request(
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     six.raise_from(e, None)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "<string>", line 3, in raise_from
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     httplib_response = conn.getresponse()
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/http/client.py", line 1348, in getresponse
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     response.begin()
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/http/client.py", line 316, in begin
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     version, status, reason = self._read_status()
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/http/client.py", line 285, in _read_status
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     raise RemoteDisconnected("Remote end closed connection without"
access_website_pywb.1.mq0w6tuyp9j4@prod2    | urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
access_website_pywb.1.mq0w6tuyp9j4@prod2    |
access_website_pywb.1.mq0w6tuyp9j4@prod2    | During handling of the above exception, another exception occurred:
access_website_pywb.1.mq0w6tuyp9j4@prod2    |
access_website_pywb.1.mq0w6tuyp9j4@prod2    | Traceback (most recent call last):
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/pywb-2.6.9-py3.8.egg/pywb/apps/frontendapp.py", line 655, in handle_request
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     response = endpoint(environ, **args)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/pywb-2.6.9-py3.8.egg/pywb/apps/frontendapp.py", line 486, in serve_content
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     return self.rewriterapp.render_content(wb_url_str, coll_config, environ)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/ukwa_pywb/./ukwa_pywb/ukwa_app.py", line 184, in render_content
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     default_response = super(UKWARewriter, self).render_content(wb_url_str, coll_config, environ)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/pywb-2.6.9-py3.8.egg/pywb/apps/rewriterapp.py", line 431, in render_content
access_website_pywb.1.mq0w6tuyp9j4@prod2    | urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
access_website_pywb.1.mq0w6tuyp9j4@prod2    |
access_website_pywb.1.mq0w6tuyp9j4@prod2    | During handling of the above exception, another exception occurred:
access_website_pywb.1.mq0w6tuyp9j4@prod2    |
access_website_pywb.1.mq0w6tuyp9j4@prod2    | Traceback (most recent call last):
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/pywb-2.6.9-py3.8.egg/pywb/apps/frontendapp.py", line 655, in handle_request
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     response = endpoint(environ, **args)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/pywb-2.6.9-py3.8.egg/pywb/apps/frontendapp.py", line 486, in serve_content
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     return self.rewriterapp.render_content(wb_url_str, coll_config, environ)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/ukwa_pywb/./ukwa_pywb/ukwa_app.py", line 184, in render_content
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     default_response = super(UKWARewriter, self).render_content(wb_url_str, coll_config, environ)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/pywb-2.6.9-py3.8.egg/pywb/apps/rewriterapp.py", line 431, in render_content
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     r = self._do_req(inputreq, wb_url, kwargs, skip_record)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/pywb-2.6.9-py3.8.egg/pywb/apps/rewriterapp.py", line 737, in _do_req
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     r = requests.post(upstream_url,
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/requests/api.py", line 115, in post
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     return request("post", url, data=data, json=json, **kwargs)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/requests/api.py", line 59, in request
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     return session.request(method=method, url=url, **kwargs)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     resp = self.send(prep, **send_kwargs)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     r = adapter.send(request, **kwargs)
access_website_pywb.1.mq0w6tuyp9j4@prod2    |   File "/usr/local/lib/python3.8/site-packages/requests/adapters.py", line 547, in send
access_website_pywb.1.mq0w6tuyp9j4@prod2    |     raise ConnectionError(err, request=request)
access_website_pywb.1.mq0w6tuyp9j4@prod2    | requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
access_website_pywb.1.mq0w6tuyp9j4@prod2    | Respawned uWSGI worker 2 (new pid: 10872)
access_website_pywb.1.mq0w6tuyp9j4@prod2    | [pid: 10861|app: 0|req: 4766542/47623998] 10.0.0.2 () {66 vars in 1750 bytes} [Tue Sep 19 21:07:16 2023] GET /wayback/archive/20150412062026if_/https://www.facebook
.com/plugins/like.php?href=http%3A%2F%2Fwww.snp.org%2Fblog%2Fpost%2F2012%2Ffeb%2Fscottish-independence-good-england&layout=button_count&show_faces=false&action=like&colorscheme=light&width=100&height=21&font&lo
cale => generated 3596 bytes in 40299 msecs (HTTP/1.1 500) 2 headers in 85 bytes (3 switches on core 198)

Trace leads to...

https://github.com/webrecorder/pywb/blob/83b2113be2c2574ec120ba292006d706e3cc3d53/pywb/apps/rewriterapp.py#L739

...which indicates it's the CDX call/lookup

There are a LOT of instances of URLs like that. Perhaps we need to add a limit?

anjackson commented 1 year ago

Yeah, some unbounded CDX calls. We can add hard upper limit, e.g.

export UKWA_INDEX="${CDX_SERVER}?url={url}&closest={closest}&sort=closest&filter=!statuscode:429&filter=!mimetype:warc/revisit&limit=100000"

...which works okay. But it'd be good if it was a bit cleverer.

anjackson commented 1 year ago

Weird, that had some odd consequences. It seems pages like this:

https://www.webarchive.org.uk/wayback/archive/20230919145754/https://www.gov.uk/government/collections/horticultural-statistics

Was redirected to the first (2013) version?! So more testing on BETA needed!

anjackson commented 1 year ago

Reminded that this don't work

https://github.com/ukwa/ukwa-services/blob/aca25a9f6ecf0724da1a0379d9cf68bfc477a110/access/website/config/pywb/config.yaml#L18

anjackson commented 1 year ago

I think this is a case where we could use PyWB/Webrecorder's advice to work out what to do.