jsvine / waybackpack

Download the entire Wayback Machine archive for a given URL.
MIT License
2.8k stars 189 forks source link

Inserting a sleep between each fetch request #71

Open Derek-Jones opened 5 months ago

Derek-Jones commented 5 months ago

I would like to be nice to the Wayback Machine and space out my requests.

An option to insert a delay of x seconds between fetching each page would allow me to reduce the load.

It looks like the Wayback Machine does have a rate limiter, which causes the current non-delayed fetch to grind to a halt.

jasonkarns commented 5 months ago

Would love to leverage tools like watch but waybackpack doesn't exit with proper status code so it's difficult to script with sleep/watch or other utilities.

When waybackpack encounters an error, it still exits with a successful status code (0) instead of an error status. (which is a bug, IMO)

Derek-Jones commented 5 months ago

I was thinking more along the lines of, say, calling time.sleep(NICE_INTERVAL) at the end of the for loop in the function download_to.

jsvine commented 5 months ago

Hi @Derek-Jones, and thanks for the suggestion. I've now added --delay X (in the CLI, and delay=X in download_to), available in v0.6.0. This adds a pause of X seconds between fetches. Let me know if it works for you.

And thanks for the note @jasonkarns. To clarify, are you saying that if waybackpack itself fails (i.e., throws a Python error), you don't get exit=0? That'd surprise me, and require one kind of debugging.

Or are you saying that when an asset fails to fetch, waybackpack ultimately completes with exit=0? If so, that seems, at least from my perspective, to be more of a user-expectations question. With voluminous fetches, the Wayback Machine can be expected to fail occasionally, and I wouldn't necessarily want to call the whole process a failure. But perhaps this could be configurable, so that if you did want any failed fetch to lead to exit=1, you could specify that.

Derek-Jones commented 5 months ago

Thanks for adding this suggestion, and doing it so quickly.

If I kick off a waybackpack, see below, the 20th Fetch appears to hang, and after some delay a variety of Python tracebacks appear.

Waiting, say 10 minutes, and rerunning produces the same behavior after fewer fetches (the --no-clobber option means that new, later, pages are fetched).

>waybackpack http://www.bsdstats.org/bt/cpus.html -d tway --max-retries 5 --no-clobber
INFO:waybackpack.pack: Fetching http://www.bsdstats.org/bt/cpus.html @ 20080813080244
INFO:waybackpack.pack: Writing to tway/20080813080244/www.bsdstats.org/bt/cpus.html
# ... lines deleted
INFO:waybackpack.pack: Fetching http://www.bsdstats.org/bt/cpus.html @ 20110725111836
INFO:waybackpack.pack: Writing to tway/20110725111836/www.bsdstats.org/bt/cpus.html

INFO:waybackpack.pack: Fetching http://www.bsdstats.org/bt/cpus.html @ 20110911091237
Traceback (most recent call last):
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connection.py", line 363, in connect
    self.sock = conn = self._new_conn()
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connection.py", line 179, in _new_conn
    raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fe81bded780>, 'Connection to web.archive.org timed out. (connect timeout=None)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/derek/.local/lib/python3.10/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/home/derek/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20110911091237/http://www.bsdstats.org/bt/cpus.html (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fe81bded780>, 'Connection to web.archive.org timed out. (connect timeout=None)'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/derek/.local/bin/waybackpack", line 8, in <module>
    sys.exit(main())
  File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/cli.py", line 144, in main
    pack.download_to(
  File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/pack.py", line 99, in download_to
    content = asset.fetch(session=self.session, raw=raw, root=root)
  File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/asset.py", line 53, in fetch
    res = session.get(url)
  File "/home/derek/.local/lib/python3.10/site-packages/waybackpack/session.py", line 29, in get
    res = requests.get(
  File "/home/derek/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/home/derek/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/derek/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/derek/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/home/derek/.local/lib/python3.10/site-packages/requests/adapters.py", line 553, in send
    raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20110911091237/http://www.bsdstats.org/bt/cpus.html (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fe81bded780>, 'Connection to web.archive.org timed out. (connect timeout=None)'))
/home/web/blog/bsdstats>
jsvine commented 5 months ago

Ah, thanks for flagging. Looks like we need to handle ConnectTimeout (instead of just ConnectionError). Attempted fix now pushed in v0.6.1.