hartator / wayback-machine-downloader

Download an entire website from the Wayback Machine.
Other
5.33k stars 706 forks source link

Not downloading all files (skips filenames with [ or ] symbols) #143

Open Panzerfuhrer opened 5 years ago

Panzerfuhrer commented 5 years ago

I saw more reports of issues of not all files being downloaded. Here is mine.

I was trying to download ftp.blizzard.com manually using wget on the separate folders until I found this fantastic program. First, I ran it with default options, but after doing a folder compare of one of my manually downloaded directories I noticed that a few files were missing. So I retried with the -s option, downloading every time stamp, still no luck.

What I found out is that for example in the folder that I tested, the files Pass_The_Bomb_Version_4[1].2.zip and Pumpkinhunt by Laundry [P].zip are missing. These are the only two files with the bracket symbols in their filenames.

This is the output in the terminal: http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip # bad URI(is not URI?): http://web.archive.org/web/20061206145254id_/http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip websites/ftp.blizzard.com/20061206145254/pub/war3/maps/spotlight/Pumpkinhunt by Laundry [P].zip was empty and was removed.

I think something happens because of the brackets. Maybe something can be changed so that the program does not care about which symbols are in a URL?

Hope this post will help you with improving this fantastic program!

iammeat commented 5 years ago

There was a modification made on another fork that may correct the problem you saw. It involved escaping the URI. I'll try it out and report back.

Panzerfuhrer commented 5 years ago

Great, thanks. Can you maybe link the fork?

adrgru commented 2 years ago

This issue still persists 2½ years later (tested on version 2.3.1). Trying the following simple command still yields the same result:

wayback_machine_downloader "http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip"

Output: Downloading http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip to websites/ftp.blizzard.com:80/ from Wayback Machine archives.

Getting snapshot pages. found 1 snaphots to consider.

1 files to download: http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip # bad URI(is not URI?): "https://web.archive.org/web/20061206145254id_/http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip" websites/ftp.blizzard.com%3a80/pub/war3/maps/spotlight/Pumpkinhunt by Laundry [P].zip was empty and was removed. http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip -> websites/ftp.blizzard.com%3a80/pub/war3/maps/spotlight/Pumpkinhunt by Laundry [P].zip (1/1)

Download completed in 1.96s, saved in websites/ftp.blizzard.com:80/ (1 files)

The folder, however, remains empty.

The same issue also appears at least with URLs containing "^".

LostAccount commented 2 years ago

Related to wayback_machine_downloader -v 2.3.1

I think that the squarebrackets [] are somehow considered as null by the waybackmachine downloader. I did a quick google search and there are other scenarios, saw one related to WebDAV, that also showed struggles with listing files and directories with square brackets.

This should be considered a bug.

As you can see below, we see the text was empty and was removed.

$ wayback_machine_downloader -d zzzz http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip
Downloading http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip to zzzz/ from Wayback Machine archives.

Getting snapshot pages. found 1 snaphots to consider.

1 files to download:
http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip # bad URI(is not URI?): "https://web.archive.org/web/20061206145254id_/http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip"
zzzz/pub/war3/maps/spotlight/Pumpkinhunt by Laundry [P].zip was empty and was removed.
http://ftp.blizzard.com:80/pub/war3/maps/spotlight/Pumpkinhunt%20by%20Laundry%20[P].zip -> zzzz/pub/war3/maps/spotlight/Pumpkinhunt by Laundry [P].zip (1/1)

It should be downloaded because I can download the file manually as evidenced below: image


I tried this, thinking a regex for just zip files might help as a workaround but not.

wayback_machine_downloader -l -c 10 --only "/\.(zip)$/i" -d zzzz http://ftp.blizzard.com:80/pub/war3/maps/spotlight/ > log.txt

Attached is the log file: log.txt