Closed derklempner closed 4 years ago
This introduces cache_enabled
option into a crawler config.
You can test this with a simple crawler config such as:
---
domain: http://cleanenergyregulator.gov.au
options:
follow_redirects: true # Allow internal redirects.
ignore_robotstxt: true # Ignore robots.txt rules around crawlability.
maximum_total: 50 # Restrict total number of crawled URLs.
concurrency: 5 # Restrict concurrent crawlers.
rewrite_domain: true # Standardises base domain.
delay: 250 # Pause in ms.
path_only: true # Return only the path from the crawled URL.
cache_enabled: true # Caches crawled content and uses cache to build results.
Fixed by #85
Description The Spider should also be cache enabled so on trawling the site the cache is populated ready for migration runs.
Proposed solution Update the spider to incorporate the cache (and unique/duplicate url check #68) classes.