hartator / wayback-machine-downloader

Download an entire website from the Wayback Machine.
Other
5.35k stars 710 forks source link

Doesn't properly work anymore #275

Open caiot5 opened 10 months ago

caiot5 commented 10 months ago

I used to use wayback-machine-downloader quite a lot, however, it doesn't seem to work anymore (at least in a proper way). The reason I think that is behind it not being able to properly download the content anymore is a connection throttling mechanism that archive.org seem to have implanted, as you can see in the log below (which you can establish from the 'connection refused' error) :

http://www.ig.com.br:80/home/editorial/stories/editorial_body/0,1205,254060,00.html # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for "web.archive.org" port 443) websites/www.ig.com.br/home/editorial/stories/editorial_body/0,1205,254060,00.html was empty and was removed.

For me it looks like one needs to slow down the individual TCP connection establishment in order not to suffer from the throttling mechanism. Is there anything we can do to delay those connections?

rustam commented 10 months ago

please give a look for this thread https://github.com/hartator/wayback-machine-downloader/issues/273#issuecomment-1886612201

caiot5 commented 10 months ago

please give a look for this thread #273 (comment)

Thanks for that. I'm using this workaround right now and it worked great! I think it needs to go mainstream 'cause (for now) wayback-machine-downloader is useless without this 'mod'.

caiot5 commented 10 months ago

It would be really nice if in the workaround we could ignore the 'sleep 3' if the file already exists.