Open Derek-Jones opened 5 months ago
Would love to leverage tools like watch
but waybackpack doesn't exit with proper status code so it's difficult to script with sleep/watch or other utilities.
When waybackpack encounters an error, it still exits with a successful status code (0
) instead of an error status. (which is a bug, IMO)
I was thinking more along the lines of, say, calling time.sleep(NICE_INTERVAL)
at the end of the for loop in the function download_to
.
Hi @Derek-Jones, and thanks for the suggestion. I've now added --delay X
(in the CLI, and delay=X
in download_to
), available in v0.6.0
. This adds a pause of X seconds between fetches. Let me know if it works for you.
And thanks for the note @jasonkarns. To clarify, are you saying that if waybackpack
itself fails (i.e., throws a Python error), you don't get exit=0
? That'd surprise me, and require one kind of debugging.
Or are you saying that when an asset fails to fetch, waybackpack
ultimately completes with exit=0
? If so, that seems, at least from my perspective, to be more of a user-expectations question. With voluminous fetches, the Wayback Machine can be expected to fail occasionally, and I wouldn't necessarily want to call the whole process a failure. But perhaps this could be configurable, so that if you did want any failed fetch to lead to exit=1
, you could specify that.
Thanks for adding this suggestion, and doing it so quickly.
If I kick off a waybackpack
, see below, the 20th Fetch appears to hang, and after some delay a variety of Python tracebacks appear.
Waiting, say 10 minutes, and rerunning produces the same behavior after fewer fetches (the --no-clobber
option means that new, later, pages are fetched).
>waybackpack http://www.bsdstats.org/bt/cpus.html -d tway --max-retries 5 --no-clobber
INFO:waybackpack.pack: Fetching http://www.bsdstats.org/bt/cpus.html @ 20080813080244
INFO:waybackpack.pack: Writing to tway/20080813080244/www.bsdstats.org/bt/cpus.html
# ... lines deleted
INFO:waybackpack.pack: Fetching http://www.bsdstats.org/bt/cpus.html @ 20110725111836
INFO:waybackpack.pack: Writing to tway/20110725111836/www.bsdstats.org/bt/cpus.html
INFO:waybackpack.pack: Fetching http://www.bsdstats.org/bt/cpus.html @ 20110911091237
Traceback (most recent call last):
File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/home/derek/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
raise err
File "/home/derek/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
conn.connect()
File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connection.py", line 363, in connect
self.sock = conn = self._new_conn()
File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connection.py", line 179, in _new_conn
raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fe81bded780>, 'Connection to web.archive.org timed out. (connect timeout=None)')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/derek/.local/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/home/derek/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20110911091237/http://www.bsdstats.org/bt/cpus.html (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fe81bded780>, 'Connection to web.archive.org timed out. (connect timeout=None)'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/derek/.local/bin/waybackpack", line 8, in <module>
sys.exit(main())
File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/cli.py", line 144, in main
pack.download_to(
File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/pack.py", line 99, in download_to
content = asset.fetch(session=self.session, raw=raw, root=root)
File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/asset.py", line 53, in fetch
res = session.get(url)
File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/session.py", line 29, in get
res = requests.get(
File "/home/derek/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/home/derek/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/home/derek/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/home/derek/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/derek/.local/lib/python3.10/site-packages/requests/adapters.py", line 553, in send
raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20110911091237/http://www.bsdstats.org/bt/cpus.html (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fe81bded780>, 'Connection to web.archive.org timed out. (connect timeout=None)'))
/home/web/blog/bsdstats>
Ah, thanks for flagging. Looks like we need to handle ConnectTimeout
(instead of just ConnectionError
). Attempted fix now pushed in v0.6.1.
I would like to be nice to the Wayback Machine and space out my requests.
An option to insert a delay of x seconds between fetching each page would allow me to reduce the load.
It looks like the Wayback Machine does have a rate limiter, which causes the current non-delayed fetch to grind to a halt.