hartator / wayback-machine-downloader

Download an entire website from the Wayback Machine.
Other
5.35k stars 710 forks source link

Downloading speed is too slow #138

Closed Adilzade closed 2 years ago

Adilzade commented 5 years ago

Hi , how can i boost up downloading speed ? I need to download approximately 550 000 html files and it would take 13 days to finish.Is there any method that would solve this issue? Thanks

Pikamander2 commented 5 years ago

Have you tried using the -c (concurrency) flag to change the number of simultaneous downloads? You could change it to 10 or 20 or something to see if it helps.

Example:

wayback_machine_downloader http://example.com --c 20

Eliteshare commented 5 years ago

I don't know if this is still relevant. I Downloaded 675,000 mixed files. It only took 3 hours. I also did the same download off of another system using the (concurrency) set to 16 and it only saved me 15 minutes.

IllyaMoskvin commented 5 years ago

Generally speaking, it's good etiquette to crawl slowly. You want to avoid hurting the Internet Archive's servers by overloading them with too many requests for too much data in too little time. It could interfere with their normal operations, e.g. serving snapshots to actual humans via the Wayback Machine. If this happens too often, it might prompt them to take measures to block downloaders such as this one.

That said, I don't know the Internet Archive's stance on people mass-scraping their snapshots, nor the capabilities of their infrastructure. It might be the case that they are perfectly fine with all of this. But generally, you don't want your scrapers to put unusual load on people's servers. That kind of behavior can get you blocked, and in the long run, might contribute to anti-scraping legislation.