Closed derklempner closed 5 years ago
The latest changes in the branch feature/spatie-fetch-content
introduces customisable Fetchers for content retrieval and a local disk-based cache and associated new options.
This is still WIP, here is a working config to show the new fetch_options
. If you want to try out the JS stuff, you will need to update/install the node modules to get puppeteer for driving headless chrome.
---
domain: http://cleanenergyregulator.gov.au
urls:
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=659'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=667'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669#something-something'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=666'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=668#anchor-me'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me-Duplicate'
- '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me-Duplicate-Another'
url_options:
# Default false
include_query: true
# Default false
include_fragment: true
# Default true
find_content_duplicates: true
# Default '//body'
hash_selector: '//body'
# Default script, comment, style, input, head
hash_exclude_nodes:
- '//script'
- '//comment()'
- '//style'
- '//input'
- '//head'
fetch_options:
# Default 10
concurrency: 10
# Delay between requests, default 100 milliseconds
delay: 100
# Cache content (and use previously cached content), default true
cache_enabled: false
# Cache storage root dir (path created if doesn't exist), default /tmp/merlin_cache
cache_dir: '/tmp/merlin_cache'
# Fetcher class, default \Migrate\Fetcher\Fetchers\SpatieCrawler\FetcherSpatieCrawler
fetcher_class: '\Migrate\Fetcher\Fetchers\SpatieCrawler\FetcherSpatieCrawler'
#fetcher_class: '\Migrate\Fetcher\Fetchers\Curl\FetcherCurl'
# Deprecated:
# fetcher_class: '\Migrate\Fetcher\Fetchers\RollingCurl\FetcherRollingCurl'
# Execute on-load JS, default false.
# Currently only available if using the FetcherSpatieCrawler fetcher class
execute_js: false
# Follow redirects, default true.
follow_redirects: true
# Ignore SSL errors, default false.
ignore_ssl_errors : true
# User Agent, default: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
user_agent : 'Merlin'
# Timeouts. When using execute_js, you want to have reasonably long timeouts.
# Not all timeouts are applicable to all Fetchers.
timeouts:
connect_timeout: 15,
timeout: 60,
# FetcherSpatieCrawler only
read_timeout: 30
entity_type: news
mappings:
-
field: alias
type: alias
-
field: title
selector: '.contentPage-content-area h1'
# selector: '.this-class-isnt-found-eh'
type: text
processors:
whitespace: { }
Description Replace RollingCurl with Spatie Crawler facilitates implementation of caching #41 as well as enabling javascript on fetch. This means that cached versions of content could contain the on-loaded view of any JS-enabled pages rather than the plain HTML DOM.
Proposed solution New
Fetcher
classes that handle the crawler version of fetching band passing into parser. ContentHash class could also be moved into these classes or to remove it fromGenerateOutput
.