sangaline / wayback-machine-scraper

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
http://sangaline.com/post/wayback-machine-scraper/
ISC License
423 stars 74 forks source link

snapshot functionality for a full site at a given time? #15

Open DOSull opened 3 years ago

DOSull commented 3 years ago

Hi, thanks for an interesting and useful project which has helped me make a start on reconstructing a site that would be really useful for a research project. I'm new to scrapy so it's been an interesting way to start learning about that.

I've been trying to make a snapshot of the whole site (or as much of it as is contained in the waybackmachine) at a particular time following the instruction here to set the from and to timestamps to the same value. However, when I do this, I only get a very incomplete snapshot of the site. If I open up the from and to range I get many more pages (but also a lot of snapshots I'm not interested in!)

I've looked at the logic in the filter_snapshots function and it all makes sense - essentially it keeps each snapshot before time_range[0] in a holding variable initial_snapshot and if the filtered_snapshots list is still empty when the time_range[1] is reached then that goes into filtered_snapshot list as the only snapshot.

Have you seen any problems like this before? Possibly related is that even if I expand the time range, then some pages don't get picked up and I have to re-run with a more specific URL to retrieve some subfolders of the site. The behaviour is consistent between runs, so I don't think it's timing out or anything, it's just not crawling to those pages for some reason. I've tried setting DEPTH_LIMIT in __main.py__ and when I run the command line it echoes the setting back to me, but that doesn't seem to make any difference.