sminnee / silverstripe-staticsiteconnector

Connector plugin for the SilverStripe External Content module that uses web scraping to import content.
8 stars 12 forks source link

SilverStripe Static Site Connector

This connector extracts content from another site by crawling its HTML, rather than connecting to an internal API. Although this has the disadvantage of leaving it unable to extract any information or structure not represented in the outputted HTML of the site, it requires no special access, nor does it rely on particular back-end systems. This makes it suited for experimental site imports, as well as connections to more obscure CMS's.

It works in the following way:

Installation

This module requires the PHP Sempahore functions. These are installed by default on Debian PHP distributions, but if you're using Macports you'll need to add the +ipc flag when installing php5:

sudo port install php5 +apache2 +ipc

If compiling PHP from source you need to pass three additional flags to the configure script:

./configure <usual flags> '--enable-sysvsem' '--enable-sysvshm' '--enable-sysvmsg'

Once that's done, you can use Composer to add the module to your SilverStripe project:

composer require silverstripe/staticsiteconnector

Finally, visit /dev/build on your site to update the database schema.

Migration

That's it! There are quite a few steps but it's easier than copy & pasting all those pages.

Schema

Schema is the name given to the collection of rules that comprise how a crawled website has its markup formatted and stored in SilverStripe's DataObjects during markup.

Each rule in a schema hinges on a CSS selector that defines the content area on a specific page of the crawled site, and the respective DataObject field within SilverStripe where this content should be stored.

Examples Rules:

Note: This example is based on your import using a subclass of SiteTree

Title

This rule takes the content of the crawled-site's <h1> element, imports it into the SiteTree.Title field which forms your imported page's <title> element.

MenuTitle

This rule takes the content of the crawled-site's <h1> element, imports it into the SiteTree.MenuTitle field. This is used in the CMS' SiteTree list.

Content

This rule takes the content of the crawled-site's main body content (excluding any <h1> elements) - in this example we pretend it's all wrapped in a div#content element. This will then form the content that is used in the SiteTree.Content field.

Meta - Description

This rule will collect the contents of a crawled-page's <meta> (description) element and imports it into the SiteTree.MetaDescription field. You can obviously adapt this to suit other <meta> elements you wish to import.

License

This code is available under the BSD license, with the exception of the PHPCrawl library, bundled with this module which is GPL version 2.