jsvine / waybackpack

Download the entire Wayback Machine archive for a given URL.
MIT License
2.81k stars 189 forks source link

Connection errors aren't caught #19

Closed Hunter-Github closed 7 years ago

Hunter-Github commented 7 years ago

requests.exceptions.ChunkedEncodingError is raised all too often.

Traceback (most recent call last):
  File "~/virt_env/bin/waybackpack", line 9, in <module>
    load_entry_point('waybackpack==0.3.2', 'console_scripts', 'waybackpack')()
  File "~/virt_env/lib/python2.7/site-packages/waybackpack/cli.py", line 88, in main
    root=args.root,
  File "~/virt_env/lib/python2.7/site-packages/waybackpack/pack.py", line 63, in download_to
    root=root
  File "~/virt_env/lib/python2.7/site-packages/waybackpack/asset.py", line 45, in fetch
    res = session.get(url)
  File "~/virt_env/lib/python2.7/site-packages/waybackpack/session.py", line 20, in get
    **kwargs
  File "~/virt_env/lib/python2.7/site-packages/requests/api.py", line 71, in get
    return request('get', url, params=params, **kwargs)
  File "~/virt_env/lib/python2.7/site-packages/requests/api.py", line 57, in request
    return session.request(method=method, url=url, **kwargs)
  File "~/virt_env/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
    resp = self.send(prep, **send_kwargs)
  File "~/virt_env/lib/python2.7/site-packages/requests/sessions.py", line 617, in send
    r.content
  File "~/virt_env/lib/python2.7/site-packages/requests/models.py", line 741, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "~/virt_env/lib/python2.7/site-packages/requests/models.py", line 667, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
jsvine commented 7 years ago

Thanks for flagging! Do you have an example URL where that raises this error?

Hunter-Github commented 7 years ago

Sure:

waybackpack -d BugExample --from-date 20130301190431 --to-date 20130301190431 http://www.reuters.com/finance/deals/

(Although I guess it may be a bit more non-deterministic, for instance when I plugged the same date into Web Archive manually, I got the archived page).

jsvine commented 7 years ago

Thanks, and indeed quite strange. I'm having an experience similar to yours: When I visit the archive page for that link, I sometimes get data, and other times an empty response. In terms of handling those errors, would you rather:

(a) waybackpack skip those snapshots, or

(b) retry up to x times, or

(c) follow some other behavior?

Also: @wumpus, any thoughts on what might be happening re. these Wayback Machine responses?

Hunter-Github commented 7 years ago

The simplest option, IMO, would be to leave the decision to the user:

Rationale:

jsvine commented 7 years ago

First pass at handling this, here: https://github.com/jsvine/waybackpack/pull/20

Adds --ignore-errors flag. Though perhaps it should be --skip-errors?

Does this look/work as expected? Or were you thinking of another approach?

Hunter-Github commented 7 years ago

Haven't tried the test yet, but the changes look sound to me - https://github.com/jsvine/waybackpack/pull/20/commits/9603712201307f4410aa4b2440c0d81aee1f9298

jsvine commented 7 years ago

Merged, incorporated into v0.3.3 and pushed to PyPi!