hartator / wayback-machine-downloader

Download an entire website from the Wayback Machine.
Other
5.28k stars 695 forks source link

Doesn't download all files #115

Open Telokis opened 6 years ago

Telokis commented 6 years ago

I'd like to link to #6 because it's related.

I'd like to confirm that trying to download https://web.archive.org/web/20150814160909/http://cadavre-exquis.dragodindons.com/ doesn't download the file http://cadavre-exquis.dragodindons.com/wp-content/themes/CadavreExquis/style.css at all. I even tried downloading ALL timestamps of the website but find . -type f | grep css doesn't return any match at all.

When I go on this link (timestamp 20150814160909) I can browse the source code and I see it is using this style.css (timestamp 20150814160909cs_). When I try to access the link of the style.css that is used it show this css (timestamp 20150923222038cs_) which is different from the one I clicked on. Maybe some sort of redirection done by the website.

What is strange is that I tried running wayback_machine_downloader http://www.cadavre-exquis.dragodindons.com -d cadavre_all -c 200 -s it creates 11 946 directories. When I run find ./cadavre_all/ -type f | grep css it doesn't return a single match. And find ./cadavre_all/ -type f | grep html | wc -l returns 92 matches (which is very low, I think)

Is there something I am completely missing here?

bobbytables commented 6 years ago

Yeah I'm seeing this too. There are files that wayback has that this doesn't pick up all the way.

GuerrillaCoder commented 6 years ago

I am having this issue also, files on visible snapshot that do not download

monty369 commented 6 years ago

I have downloaded now how can I restore my website? do I need WP Migration to upload those file? ANY HELP?

monty369 commented 6 years ago

it will download all the files, you just need to press "enter" key and it will again start, its kind of lagging

GuerrillaCoder commented 6 years ago

No, there are some files it just doesn't download. I can run it repeatedly and it will always miss same files. I ended up coding my own script which gets all the files. I don't use ruby so cant help with this bug here unfortunately.

ellyjonez commented 6 years ago

I think it doesn't download all files because it reconstructs the URLs with the original URLs, not the wayback URLs. So if the assets are stored on wayback, it doesn't grab them, because it's expecting the assets to still be on your server.

@GuerrillaCoder would you mind sharing your script? is it on github anywhere or would you post a gist or something? I have tried a bunch of things to try to grab all files and not had luck so far. Thanks for any help!

scimax commented 6 years ago

I have the same issue. Especially css and image files are missing. @GuerrillaCoder would you mind sharing your script? What did you use instead of ruby?

bricep commented 4 years ago

@GuerrillaCoder is your "websitedataconnector" repo the script you made to get around this? if not, can you upload to github and share?