jsvine / waybackpack

Download the entire Wayback Machine archive for a given URL.
MIT License
2.8k stars 189 forks source link

Handle 503 when not from server overload #36

Closed RayBB closed 3 years ago

RayBB commented 4 years ago

Currently cannot download all files for this url: https://www.amazon.com/Art-Gathering-How-Meet-Matters/dp/1594634920

The problem is that some of the pages return a 503 (I assume because amazon returned that robot check) so the bot just gets stuck with them.

It would be good to add a flag about how to handle 503 or something of the sort. Either skip 503s, download anyway, or only retry X times. Something like that could help with this.

Here's an example of the page that will always be 503: https://web.archive.org/web/20190506092829/https://www.amazon.com/Art-Gathering-How-Meet-Matters/dp/1594634920

Thanks for making this awesome project :)

jsvine commented 3 years ago

Thank you for raising this issue! Fix (using a maximum number of retries — default == 3 but configurable) is pushed and now available in v0.3.6.