jsvine / waybackpack

Download the entire Wayback Machine archive for a given URL.
MIT License
2.8k stars 189 forks source link

add --no-clobber/-nc option #55

Closed jeremybmerrill closed 2 years ago

jeremybmerrill commented 2 years ago

mimicing wget, adds a no-clobber option, so that files that already exist (and have non-zero size) won't be re-downloaded

On command line

$ waybackpack  http://www.whitehouse.gov/ --no-clobber -d ~/Downloads/dol-wayback --to-date 199803 --from-date 199801

INFO:waybackpack.pack: Fetching http://www.whitehouse.gov/ @ 19980215014716
INFO:waybackpack.pack: Writing to /Users/merrillj/Downloads/dol-wayback/19980215014716/www.whitehouse.gov/index.html

$ waybackpack git:(master) waybackpack  http://www.whitehouse.gov/ --no-clobber -d ~/Downloads/dol-wayback --to-date 199803 --from-date 199801

Or in Python, download_to(..., no_clobber=True)

Tested in tests/test-download.py.

As a note, Jeremy, the dol.gov URL used in the existing test there is all redirects with zero content (which confused the heck out of me) and may be worth changing to be a URL that returns content from WayBack.

jsvine commented 2 years ago

Thank you for this! Great suggestion/addition. And thanks for adding the test. Merging, though will be removing the -nc short flag, since it's slightly inconsistent with the other CLI params.