salsadigitalauorg / merlin-framework

Merlin - migration framework
GNU General Public License v3.0
16 stars 3 forks source link

Replace RollingCurl with Spatie Crawler #71

Closed derklempner closed 5 years ago

derklempner commented 5 years ago

Description Replace RollingCurl with Spatie Crawler facilitates implementation of caching #41 as well as enabling javascript on fetch. This means that cached versions of content could contain the on-loaded view of any JS-enabled pages rather than the plain HTML DOM.

Proposed solution New Fetcher classes that handle the crawler version of fetching band passing into parser. ContentHash class could also be moved into these classes or to remove it from GenerateOutput.

derklempner commented 5 years ago

The latest changes in the branch feature/spatie-fetch-content introduces customisable Fetchers for content retrieval and a local disk-based cache and associated new options.

This is still WIP, here is a working config to show the new fetch_options. If you want to try out the JS stuff, you will need to update/install the node modules to get puppeteer for driving headless chrome.

---
domain: http://cleanenergyregulator.gov.au

urls:
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=659'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=667'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=669#something-something'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=666'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=668#anchor-me'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me-Duplicate'
  - '/About/Pages/News%20and%20updates/NewsItem.aspx?ListId=19b4efbb-6f5d-4637-94c4-121c1f96fcfe&ItemId=665#Anchor-Me-Duplicate-Another'

url_options:
  # Default false
  include_query: true       

  # Default false
  include_fragment: true

  # Default true
  find_content_duplicates: true

  # Default '//body'
  hash_selector: '//body' 

  # Default script, comment, style, input, head  
  hash_exclude_nodes:      
    - '//script'  
    - '//comment()'
    - '//style'
    - '//input'
    - '//head'

fetch_options:  
  # Default 10   
  concurrency: 10

  # Delay between requests, default 100 milliseconds
  delay: 100

  # Cache content (and use previously cached content), default true
  cache_enabled: false

  # Cache storage root dir (path created if doesn't exist), default /tmp/merlin_cache
  cache_dir: '/tmp/merlin_cache'

  # Fetcher class, default \Migrate\Fetcher\Fetchers\SpatieCrawler\FetcherSpatieCrawler
  fetcher_class: '\Migrate\Fetcher\Fetchers\SpatieCrawler\FetcherSpatieCrawler'
  #fetcher_class: '\Migrate\Fetcher\Fetchers\Curl\FetcherCurl'
  # Deprecated:
  # fetcher_class: '\Migrate\Fetcher\Fetchers\RollingCurl\FetcherRollingCurl'

  # Execute on-load JS, default false.
  # Currently only available if using the FetcherSpatieCrawler fetcher class
  execute_js: false

  # Follow redirects, default true.
  follow_redirects: true

  # Ignore SSL errors, default false.
  ignore_ssl_errors : true

  # User Agent, default: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
  user_agent : 'Merlin'

  # Timeouts.  When using execute_js, you want to have reasonably long timeouts.
  # Not all timeouts are applicable to all Fetchers.
  timeouts:
    connect_timeout: 15, 
    timeout: 60,
    # FetcherSpatieCrawler only
    read_timeout: 30

entity_type: news
mappings:
  -
    field: alias
    type: alias
  -
    field: title
    selector: '.contentPage-content-area h1'
    # selector: '.this-class-isnt-found-eh' 
    type: text
    processors:
      whitespace: { }