jsvine / waybackpack

Download the entire Wayback Machine archive for a given URL.
MIT License
2.8k stars 189 forks source link

question: able to download a website historically while only saving the 1st successful page? #70

Open devinschumacher opened 7 months ago

devinschumacher commented 7 months ago

any change to get a feature where we can download a site from a range of dates? for example 2015-Today to try and get every copy of a URL, but only save the most successful download?

the use case is im trying to get a website, but some pages are "blocked by cloudflare" on certain versions of archive.org

thanks!

jsvine commented 7 months ago

I don't think waybackpack currently supports this, but would be open to a PR that adds it. One tricky bit might be defining a criteria for "successful", particularly if the HTTP status code does not make it clear.

devinschumacher commented 7 months ago

I don't think waybackpack currently supports this, but would be open to a PR that adds it. One tricky bit might be defining a criteria for "successful", particularly if the HTTP status code does not make it clear.

yeah i was thinking that same thing about the criteria.

it would probably be a series of words/patterns that would get added to over time until it was reasonably comprehensive? might be some stuff in the the HTML tags as well i bet the meta title and description on pages like that would always give it away

what i normally see are things like Cloudflare, Login, Too Many Requests etc.