Open pepecossio opened 3 years ago
Figured some things out. So this is a wiki that is being scraped. The scraper is saving the wiki image preview as an entire image file. So when I open the file with text editor or change the extension to .html it displays the html code.
Is there a way you can add media wiki support so it grabs real image files?
Looks like for this particular MediaWiki instance, the images are stored on a different domain (images.encyclopediadramatica.rs) so I suggest that you download the files from there instead.
Other MediaWiki instances like Wikipedia also do this, but use different domains. I think there is no standard way to figure out what the right domain is for the images, so this isn't really possible to fix in wayback_machine_downloader.
The only way to really do what you want properly is a browser plugin or proxy server that forwards all requests through archive.org and saves the results locally, plus a browser plugin to automatically load all URLs of a domain that are on archive.org.
You can do something similar to this by archiving a domain and then grepping all files for links, but you will still miss URLs that are constructed by JavaScript and then issued.
I am scraping encyclopediadramatica.rs . The images downloaded, and PDFs all have the same error, "file not supported" , "can't open file", and/or "not correctly decoded". All files are around 20-24kb. Anyone else having this error, what gives? BTW, the archive is quite large, if you want to checkout this issue I suggest passing :
--only "/.(gif|jpg|jpeg|mp4|ogg|ovg|ogv|webp|mp3|pdf|png|mov|webm|mkv|svg)$/i"
In order just to grab the images.