sangaline / wayback-machine-scraper

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
http://sangaline.com/post/wayback-machine-scraper/
ISC License
423 stars 74 forks source link

Crashes (includes fix) #4

Closed Cerno-b closed 3 years ago

Cerno-b commented 6 years ago

Since I can't commit to your project, here are two fixes that I had to made in order to get the scraper to run:

In mirror_spider.py, line 50, there is no check whether the output path is valid. The URL can contain ? characters which causes the script to crash.

Here's my solution, it's just a quick fix and may require elaboration for different characters and Linux/Windows compatibility:

url_parts = response.url.split('://')[1].split('/')
parent_directory = os.path.join(self.directory, *url_parts)
parent_directory = parent_directory.replace("?", "_q_")  # normalize path

os.makedirs(parent_directory, exist_ok=True)

There is another bug in your other project scrapy_wayback_machine which is imported here, that causes a crash.

It's in init.py, line 91:

cdx_url = self.cdx_url_template.format(url=pathname2url(request.url))

At this point, request.url is something like http://website.com. But pathname2url will look for a colon : and require that anything before that is only one letter (since we are dealing with regular paths here, like C:\mypath.

When I removed the call to pathname2url, it worked for me, but I don't know which other cases may break:

cdx_url = self.cdx_url_template.format(url=request.url)

sangaline commented 3 years ago

This should be resolved between #14 and sangaline/scrapy-wayback-machine#2. Please let me know if you're still running into any issues.