Replace RollingCurl with Spatie Crawler

The latest changes in the branch feature/spatie-fetch-content introduces customisable Fetchers for content retrieval and a local disk-based cache and associated new options.

This is still WIP, here is a working config to show the new fetch_options. If you want to try out the JS stuff, you will need to update/install the node modules to get puppeteer for driving headless chrome.

---
domain: http://cleanenergyregulator.gov.au

urls:
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=659'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=667'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669#something-something'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=666'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=668#anchor-me'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me-Duplicate'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me-Duplicate-Another'

url_options:
  # Default false
  include_query: true       

  # Default false
  include_fragment: true

  # Default true
  find_content_duplicates: true

  # Default '//body'
  hash_selector: '//body' 

  # Default script, comment, style, input, head  
  hash_exclude_nodes:      
    - '//script'  
    - '//comment()'
    - '//style'
    - '//input'
    - '//head'

fetch_options:  
  # Default 10   
  concurrency: 10

  # Delay between requests, default 100 milliseconds
  delay: 100

  # Cache content (and use previously cached content), default true
  cache_enabled: false

  # Cache storage root dir (path created if doesn't exist), default /tmp/merlin_cache
  cache_dir: '/tmp/merlin_cache'

  # Fetcher class, default \Migrate\Fetcher\Fetchers\SpatieCrawler\FetcherSpatieCrawler
  fetcher_class: '\Migrate\Fetcher\Fetchers\SpatieCrawler\FetcherSpatieCrawler'
  #fetcher_class: '\Migrate\Fetcher\Fetchers\Curl\FetcherCurl'
  # Deprecated:
  # fetcher_class: '\Migrate\Fetcher\Fetchers\RollingCurl\FetcherRollingCurl'

  # Execute on-load JS, default false.
  # Currently only available if using the FetcherSpatieCrawler fetcher class
  execute_js: false

  # Follow redirects, default true.
  follow_redirects: true

  # Ignore SSL errors, default false.
  ignore_ssl_errors : true

  # User Agent, default: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
  user_agent : 'Merlin'

  # Timeouts.  When using execute_js, you want to have reasonably long timeouts.
  # Not all timeouts are applicable to all Fetchers.
  timeouts:
    connect_timeout: 15, 
    timeout: 60,
    # FetcherSpatieCrawler only
    read_timeout: 30

entity_type: news
mappings:
  -
    field: alias
    type: alias
  -
    field: title
    selector: '.contentPage-content-area h1'
    # selector: '.this-class-isnt-found-eh' 
    type: text
    processors:
      whitespace: { }

salsadigitalauorg / merlin-framework

Replace RollingCurl with Spatie Crawler #71