jsvine / waybackpack

Download the entire Wayback Machine archive for a given URL.
MIT License
2.88k stars 190 forks source link

Uniques only #7

Closed ErikBorra closed 8 years ago

ErikBorra commented 8 years ago

Feature suggestions:

Keep up the good work.

jsvine commented 8 years ago

Thanks, @ErikBorra! Those are both really interesting ideas. Some thoughts/questions:

only retrieve unique snapshots

I'm thinking that using this option would retain only the chronologically-earliest copy. Yeah? And it'd look something like waybackpack dol.gov -d ~/Downloads/dol-dot-gov --unique? Or --unique-only?

only retrieve snapshots closest to a particular set of dates (e.g. 1 July of each year)

This is intriguing, but feels like the additional complexity might outweigh the added functionality. What do you think the logic would look like for this? And how would, e.g., "1 July of each year", be expressed as arguments on the command line?

ErikBorra commented 8 years ago

Hi @jsvine,

yes to the first question.

As for the second, one could loop over years (from 1996 until the current year) and specify the following as the datestamp when calling the Wayback API: YEAR0701000000. This way one can retrieve a single version per year, closest to 1 July (the Wayback machine does the 'closest' match for you).

And a third option: get one archived version per month, 10days, or 1 day by using collapse=timestamp:6, collapse=timestamp:7, collapse=timestamp:8 respectively.

MechMykl commented 8 years ago

Re: Question 1 - Not sure how technically complex this would be, but if the script were to pull down the first complete copy of the site and then in subsequent folders pull down only files that are different, that would again be useful to my case of wanting a complete archive of my old sites.

ErikBorra commented 8 years ago

Re: Question 1 - should be really simple, by specifying showDupeCount=true when calling the API.

Re: third option, and in addition to Re: Question 1: The collapse param can be used to further filter on month or days.

jsvine commented 8 years ago

Thanks again for these suggestions! Version 0.3.0, now on the develop branch, includes both these features, and moves the library away from Memento TimeMaps to the CDX search.

Along the lines of what you were hoping?

ErikBorra commented 8 years ago

@jsvine awesome!

The readme should probably be updated to reflect these additions. Also, it may be good to provide some examples of the collapse parameter in the documentation (either as feedback from the script, or in the wiki or so).

Cheers,

Erik

jsvine commented 8 years ago

Great! The changes haven't been merged into master yet; the new README can be found on the develop branch. Reflects the new additions, and links to documentation for the collapse parameters.

jsvine commented 8 years ago

Now in master and pushed to PyPi.