hartator / wayback-machine-downloader

Download an entire website from the Wayback Machine.
Other
5.16k stars 676 forks source link

Corrupted Image Files #189

Open pepecossio opened 3 years ago

pepecossio commented 3 years ago

I am scraping encyclopediadramatica.rs . The images downloaded, and PDFs all have the same error, "file not supported" , "can't open file", and/or "not correctly decoded". All files are around 20-24kb. Anyone else having this error, what gives? BTW, the archive is quite large, if you want to checkout this issue I suggest passing :

--only "/.(gif|jpg|jpeg|mp4|ogg|ovg|ogv|webp|mp3|pdf|png|mov|webm|mkv|svg)$/i"

In order just to grab the images.

pepecossio commented 3 years ago

Figured some things out. So this is a wiki that is being scraped. The scraper is saving the wiki image preview as an entire image file. So when I open the file with text editor or change the extension to .html it displays the html code.

Is there a way you can add media wiki support so it grabs real image files?

pabs3 commented 3 years ago

Looks like for this particular MediaWiki instance, the images are stored on a different domain (images.encyclopediadramatica.rs) so I suggest that you download the files from there instead.

Other MediaWiki instances like Wikipedia also do this, but use different domains. I think there is no standard way to figure out what the right domain is for the images, so this isn't really possible to fix in wayback_machine_downloader.

The only way to really do what you want properly is a browser plugin or proxy server that forwards all requests through archive.org and saves the results locally, plus a browser plugin to automatically load all URLs of a domain that are on archive.org.

You can do something similar to this by archiving a domain and then grepping all files for links, but you will still miss URLs that are constructed by JavaScript and then issued.