sangaline / wayback-machine-scraper

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
http://sangaline.com/post/wayback-machine-scraper/
ISC License
423 stars 74 forks source link

[Question] How to get latest crawl? #13

Open santoshbs opened 3 years ago

santoshbs commented 3 years ago

Is there a way that I can get the most recent version (a single version) of a full site crawl of a list of URLs?

sangaline commented 3 years ago

This isn't supported by the current API, but adding an -l/--latest option seems like a good feature request. In the meantime, you might be interested in scrapy-wayback-machine. That project provides a Scrapy middleware that this project uses under the hood, and it offers a lot more flexibility in terms of customizing behavior.

I'll leave this open as a feature request.