salsadigitalauorg / merlin-framework

Merlin - migration framework
GNU General Public License v3.0
16 stars 3 forks source link

Add cache layer to Spider #74

Closed derklempner closed 4 years ago

derklempner commented 5 years ago

Description The Spider should also be cache enabled so on trawling the site the cache is populated ready for migration runs.

Proposed solution Update the spider to incorporate the cache (and unique/duplicate url check #68) classes.

derklempner commented 5 years ago

This introduces cache_enabled option into a crawler config.

You can test this with a simple crawler config such as:

---
domain: http://cleanenergyregulator.gov.au

options:
  follow_redirects: true  # Allow internal redirects.
  ignore_robotstxt: true  # Ignore robots.txt rules around crawlability.
  maximum_total: 50        # Restrict total number of crawled URLs.
  concurrency: 5          # Restrict concurrent crawlers.
  rewrite_domain: true    # Standardises base domain.
  delay: 250              # Pause in ms.
  path_only: true         # Return only the path from the crawled URL.
  cache_enabled: true     # Caches crawled content and uses cache to build results.
stooit commented 4 years ago

Fixed by #85