hartator / wayback-machine-downloader

Download an entire website from the Wayback Machine.
Other
5.24k stars 691 forks source link

DO NOT USE unless you have a means of rate limiting yourself #281

Open jdimpson opened 6 months ago

jdimpson commented 6 months ago

The Wayback Machine is (rightfully) blocking bulk downloads that exceed too much bandwidth or requests per secon. As far as I can tell, this product does no rate-limiting of itself, at least not by default, per any examples in the README. As a result, the Internet Archive will soft ban your IP address if you use this script on a web site of any significant size.

It's irresponsible to leave this repository up without at least a warning in the documentation.

tinyapps commented 6 months ago

See ShiftaDeband's fork (which contains the fixes mentioned in his PR) as well as issues #273 and #275.

Elmagenta commented 5 months ago

See ShiftaDeband's fork (which contains the fixes mentioned in his PR) as well as issues #273 and #275.

Sorry to bother, i'm pretty new in this, how can i actually use this fork instead of the master branch?

tinyapps commented 5 months ago

@Elmagenta: You'll need to have Ruby installed then you can just download ShiftaDeband's fork as a ZIP file, unzip it, and run wayback_machine_downloader which you'll find in the bin subdirectory.

flag-br commented 5 months ago

@tinyapps I'm also pretty new in this, and I couldn't follow your instructions. I have Ruby installed, and I had also installed the "original" wayback_machine_downloader via Mac OS Terminal. Now, following your instructions, I downloaded the ZIP file and simply tried to run the binary file. But I get an error message

/Users/flag/Downloads/wayback-machine-downloader-feature-httpGet/bin/wayback_machine_downloader:3:in `require_relative': cannot load such file -- /Users/flag/Downloads/wayback-machine-downloader-feature-httpGet/lib/wayback_machine_downloader (LoadError) from /Users/flag/Downloads/wayback-machine-downloader-feature-httpGet/bin/wayback_machine_downloader:3:in "

"

Could you give more details on how to proceed?

tinyapps commented 5 months ago

@flag-br: Sounds like you might've deleted (or not extracted) the included lib directory or its contents. After unzipping wayback-machine-downloader-feature-httpGet.zip, just cd into the bin subdirectory and run wayback_machine_downloader without deleting any of the other included files or folders. The directory structure should look like this:

.
├── Dockerfile
├── Gemfile
├── MIT-LICENSE.txt
├── README.md
├── Rakefile
├── bin
│   └── wayback_machine_downloader
├── lib
│   ├── wayback_machine_downloader
│   │   ├── archive_api.rb
│   │   ├── tidy_bytes.rb
│   │   └── to_regex.rb
│   └── wayback_machine_downloader.rb
├── test
│   └── test_wayback_machine_downloader.rb
└── wayback_machine_downloader.gemspec
flag-br commented 5 months ago

@tinyapps Thank you very much, it worked! It ran normally, but the final product is practically the same as what I was getting before with the master branch version. The folder structure apparently reproduced correctly on my machine, but only 15 htm files were downloaded. To check, I ran wayback_machine_downloader with the --list option, and the answer is that there are 1116 htm files.

The command I'm using is (after cd to bin folder): wayback_machine_downloader https://jazzdiscogcorner.pagesperso-orange.fr/

This site is quite simple, just text and practically no images.

Am I doing something wrong?

tinyapps commented 5 months ago

@flag-br: Glad to hear it worked out. As for issues with a specific site, I'd recommend checking out the documentation and searching through the open and closed issues before posting a new issue.

eggplantedd commented 4 months ago

@flag-br: Sounds like you might've deleted (or not extracted) the included lib directory or its contents. After unzipping wayback-machine-downloader-feature-httpGet.zip, just cd into the bin subdirectory and run wayback_machine_downloader without deleting any of the other included files or folders. The directory structure should look like this:

.
├── Dockerfile
├── Gemfile
├── MIT-LICENSE.txt
├── README.md
├── Rakefile
├── bin
│   └── wayback_machine_downloader
├── lib
│   ├── wayback_machine_downloader
│   │   ├── archive_api.rb
│   │   ├── tidy_bytes.rb
│   │   └── to_regex.rb
│   └── wayback_machine_downloader.rb
├── test
│   └── test_wayback_machine_downloader.rb
└── wayback_machine_downloader.gemspec

I'm being stupid here, but trying to run wayback_machine_downloader (type - file) in the bin directory gave me "not recognized as an internal or external command, operable program or batch file". Fresh Ruby install.

I had to gem build wayback_machine_downloader.gemspec, then gem install wayback_machine_downloader-2.3.2.gem that was generated, and finally I could run wayback_machine_downloader from cmd in a working fashion. Any advice on what I was doing wrong?

CaptSolo commented 3 months ago

It would be great to have rate limiting added to this software. Without it archive.org is (rightfully) returning "Connection refused" errors.

P.S. It is good that there is a fork with fixes. Just wishing that the main repo of this software had those fixes too.

nico9julio commented 2 months ago

This patched version worked beautifully ...

For those who are in Windows and do not understand much how to do it:

gem install wayback_machine_downloader

Replace bin and lib folders in: C:\Ruby33-x64\lib\ruby\gems\3.3.0\gems\wayback_machine_downloader-2.3.1 for those in the compressed file. https://github.com/ShiftaDeband/wayback-machine-downloader/archive/refs/heads/feature/httpGet.zip

irrdkwhattoput commented 1 month ago

Doesnt seem to work anymore... gives Net::ReadTimeout with #<TCPSocket:(closed)> (Net::ReadTimeout)