Doesn't grab every directory

hartator / wayback-machine-downloader

Download an entire website from the Wayback Machine.

Other

5.33k stars 708 forks source link

Doesn't grab every directory #36

Closed leilei- closed 8 years ago

leilei- commented 8 years ago

For totally personal nostaligabrowse reasons I run this to grab a 97/2000 site (with an appropriate timestamp) that had lots and lots of files and many directories (and not dynamic content like a forum). The site from later had dynamic content excessively captured (as well as the squatter), so the directories seem to stop after a certain letter, even though I know there's directories after that letter.

Is there a workaround for this?

hartator commented 8 years ago

Do you mind sharing the url of the website?

It it's not on Wayback Machine though, it won't be possible to retreive unfortunately.

On Friday, March 25, 2016, leilei- notifications@github.com wrote:

Hi

For totally personal nostaligabrowse reasons I run this to grab a 97/2000 site (with an appropriate timestamp) that had lots and lots of files and many directories (and not dynamic content like a forum). The site from later had dynamic content excessively captured (as well as the squatter), so the directories seem to stop after a certain letter, even though I know there's directories after that letter.

Is there a workaround for this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/hartator/wayback-machine-downloader/issues/36

leilei- commented 8 years ago

Techtv.com for example (tons of dynamic pages) makes it as far as "netcamu" when given a 2001 timestamp. The later directories like screensavers, siliconspin, youmadeit etc. have definitely been archived but aren't retrieved by the downloader unless explicitly given.

TechTv's a nice example because of the changes it went through that broke everything in 2003 through the G4 merger, and much later in the 2015 Esquire change, so there's a load of irrelevant dynamic cludge archived.

hartator commented 8 years ago

Not finding anything specific that will be break down the downloader. I am seeing a lot of redirections to http://www.g4techtv.com/, but it shouldn't be an issue with a supplied timestamp.

Can you share the exact command you are running and the https://web.archive.org urls of the some of the contents missing?

leilei- commented 8 years ago

Try March 2001 which would be their third month as TechTV with much of the content being imported from their ZDTv incarnation.

wayback_machine_downloader http://techtv.com --only screensavers --timestamp 20010325225358

Doesn't grab screensavers/ but cgi scripts that refer to it as well as some graphics/ folder for it.

wayback_machine_downloader http://techtv.com/screensavers --timestamp 20010325225358

Grabs 3362 files in screensavers/

hartator commented 8 years ago

Thanks for the feedback, :)

I did a bit of digging. I've found out that the wayback machine API - http://web.archive.org/cdx/ -we are using is not reporting every files if the archive list is too long. ~ 500,000 files seem the maximum it can report.

http://web.archive.org/cdx/search/xd?url=http://techtv.com/ is ~ 89.1M and contains 472,946 files. However, it stops at the letter C - callforhelp/features/story/0 and contains only 1,708 screensaver files. Whereas, http://web.archive.org/cdx/search/xd?url=http://techtv.com/screensavers/\ is ~ 34.9M and contains 180,591 screensaver files.

A work around, it's to do what you have done. Directly working with the subfolder in the url, http://techtv.com/screensavers.

I don't know what's the best approach to choose to solve this. We can probably do recurring archive list requests to avoid the restriction. However, that's a bit aggressive towards Archive.org.