localnerve / html-snapshots

A selector-based html snapshot tool using Puppeteer or PhantomJS that sources sitemap.xml, sitemap-index, robots.txt, or arbitrary input
MIT License
126 stars 19 forks source link
html-snapshots javascript phantomjs-process puppeteer

html-snapshots

npm version Verify Coverage Status

Takes html snapshots of your site's crawlable pages when an element you select is rendered.

Contents

Overview

html-snapshots is a flexible html snapshot library that uses a headless browser to take html snapshots of your webpages served from your site. A snapshot is only taken when a specified selector is detected visible in the output html. This tool is useful when your site is largely ajax content, or an SPA, and you want your dynamic content indexed by search engines.

html-snapshots gets urls to process from either a robots.txt, sitemap.xml, or sitemap-index.xml. Alternatively, you can supply an array with completely arbitrary urls, or a line delimited textfile with arbitrary host-relative paths.

Getting Started

Installation

The simplest way to install html-snapshots is to use npm, just npm install html-snapshots will download html-snapshots and all dependencies.

Gulp Task

This is a node library that just works with gulp as-is.

Grunt Task

If you are interested in the grunt task that uses this library, check out grunt-html-snapshots.

More Information

Here are some background and other notes regarding this project.

Process Model

html-snapshots takes snapshots in parallel, each page getting its own browser process. Each browser process dies after snapshotting one page. You can limit the number of browser processes that can ever run at once with the processLimit option. This effectively sets up a process pool for browser instances. The default processLimit is 4 browser instances. When a browser process dies, and another snapshot needs to be taken, a new browser process is spawned to take the vacant slot. This continues until a processLimit number of processes are running at once.

API

The api is just one run method that returns a Promise.

Promise run (options[, callback])

A method that takes options and an optional callback. Returns a Promise.
Syntax:

const htmlSnapshots = require('html-snapshots');

htmlSnapshots.run(options[, callback])
.then(completed => {
  // `completed` is an array of paths to the completed snapshots.
})
.catch(errorObject => {
  // `errorObject` is an instance of Error
  // `errorObject.completed` is an array of paths to the snapshots that did successfully complete.
  // `errorObject.notCompleted` is an array of paths to files that DID NOT successfully complete.
});

Callback

The callback is optional because the run method returns a Promise that resolves on completion. If you supply a callback, it will be called, but the Promise will ALSO resolve. Callback usage is deprecated, and is made available for compatibility with older versions.

Signature of the optional callback:

callback (errorObject, arrayOfPathsToCompletedSnapshots)

For the callback, in the error case, the errorObject does not have the new extra properties completed and notCompleted. However, arrayOfPathsToCompletedSnapshots is supplied, and contains the paths to the snapshots that successfully completed.

Example Usage

This example reads the pages from a mix of sitemap or sitemap-index files found in the robots.txt and produces snapshots in the ./snapshots directory. In this example, a selector named "#dynamic-content" appears in all pages across the site. Once this selector is visible in a page, the html snapshot is taken and saved to ./snapshots.

Quick Example

const htmlSnapshots = require('html-snapshots');
htmlSnapshots.run({
  source: 'https://host.domain/robots.txt',
  selector: '#dynamic-content',
  outputDir: './snapshots',
  outputDirClean: true
})
.then(completed => {
  // completed is an array of full file paths to the completed snapshots.
})
.catch(error => {
  // error is an Error instance.
  // error.completed is an array of snapshot file paths that did complete.
  // error.notCompleted is an array of file paths that did NOT complete.
});

More examples can be found in this document. Also, A showcase of runnable examples can be found here.

An older (version 0.13.2), more in depth usage example is located in this article that includes explanation and code of a real usage featuring dynamic app routes, ExpressJS, Heroku, and more.

Options

Every option has a default value except outputDir.

Input Control Options

Sitemap Only Input Options

Options that apply to robots.txt with Sitemap directives, sitemaps, and sitemap-index input

Origin Options

Origin options are only useful for Robots.txt files that use Allow directives and Textfile input types.

Output Control Options

Snapshot Control Options

Process Control Options

Example Rewrite Rule

Here is an example apache rewrite rule for rewriting _escaped_fragment_ requests to the snapshots directory on your server.

<ifModule mod_rewrite.c>
  RewriteCond %{QUERY_STRING} ^_escaped_fragment_=(.*)$
  RewriteCond %{REQUEST_URI} !^/snapshots [NC]
  RewriteRule ^(.*)/?$ /snapshots/$1 [L]
</ifModule>

This serves the snapshot to any request for a url (perhaps found by a bot in your robots.txt or sitemap.xml) to the snapshot output directory. In this example, no translation is done, it simply takes the request as is and serves its corresponding snapshot. So a request for http://mysite.com/?_escaped_fragment_= serves the mysite.com homepage snapshot.

Connect-modrewrite

You can also refer _escaped_fragment_ requests to your snapshots in ExpressJS with a similar method using connect-modrewrite middleware. Here is an analogous example of a connect-modrewrite rule:

  '^(.*)\\?_escaped_fragment_=.*$ /snapshots/$1 [NC L]'

Middleware Example

An ExpressJS middleware example using html-snapshots can be found at wpspa/server/middleware/snapshots.js.
Here is the article on how this middleware works with html-snapshots.

License

This software is free to use under the LocalNerve, LLC MIT license. See the LICENSE file for license text and copyright information.

Third-party open source code used are listed in the package.json file.