salsadigitalauorg / merlin-framework

Merlin - migration framework
GNU General Public License v3.0
17 stars 3 forks source link

Add a new group plugin for crawler. #137

Closed steveworley closed 3 years ago

steveworley commented 3 years ago

Adds a new crawler to help identify types based on regular expression attribute determining. The main idea around this is more using Merlin's crawler to quickly give indicative numbers of page types based on element and attribute filtering.

The example below is used to identify content types using standard Drupal outputs.

Example config

groups:
    -
      id: node
      type: element_filter
      options:
        selector: body
        filter_attr: class
        pattern: '/node-type-\w+/'

Output:

Generating /tmp/test/crawled-urls-node_node.yml Done!
Generating /tmp/test/crawled-urls-node_node-node-type-page.yml Done!
Generating /tmp/test/crawled-urls-node_node-node-type-landing.yml Done!
Generating /tmp/test/crawled-urls-node_node-node-type-webform.yml Done!
Generating /tmp/test/crawled-urls-node_node-node-type-news.yml Done!
Generating /tmp/test/crawled-urls-node_node-node-type-footer.yml Done!
Generating /tmp/test/crawled-urls-node_redirects.yml Done!
Generating /tmp/test/crawled-urls-node_default.yml Done!
Generating /tmp/test/crawled-urls-node_node-node-type-consultation.yml Done!
Generating /tmp/test/crawled-urls-node_node-node-type-literature.yml Done!
Generating /tmp/test/crawl-error-node.yml Done!