veliovgroup / jazeee-meteor-spiderable

Fork of Meteor Spiderable with longer timeout, caching, better server handling
https://atmospherejs.com/jazeee/spiderable-longer-timeout
33 stars 9 forks source link
crawlable meteor-package meteor-spiderable meteorjs phantomjs seo

spiderable-longer-timeout

About

This is a fork of the standard meteor spiderable package, with some merged code from ongoworks:spiderable package. Primarily, this lengthens the timeout to 30 seconds and size limit to 10MB. All results will be cached to Mongo collection, by default for 3 hours (180 minutes).

This package will ignore all SSL error in favor of page fetching.

This package supports "real response-code" and "real headers", this means if your route returns 301 response code with some headers the package will return the same headers. This package also has support for JavaScript redirects.

phantomjs and consequently this package doesn't support ES6 (ECMAScript 2015), if you're not compiling ES6 to ES5, or using NPM packages written in ES6 (Meteor doesn't compile NPM packages) it will result in blank pages after rendering. There is no easy way to solve it with drop-in package/solution. We recommend to solve it with prerendering by ostr.io, which has ES6 (ECMAScript 2015) support and can be installed with one command.

This package tested with iron-router, flow-router, and flow-router-extra with and without next packages:

This package has build-in caching mechanism, by default it stores results for 3 hours, to change storing period set Spiderable.cacheLifetimeInMinutes to other value in minutes.

Installation

meteor add jazeee:spiderable-longer-timeout

ES6 import

import { Spiderable } from 'meteor/jazeee:spiderable-longer-timeout';

Setup:

SPIDERABLE_FLAGS environment variable

Issues like select: Invalid argument can be easily solved with additional phantomjs process flags (arguments). Default flags:

phantomjs --load-images=no --ssl-protocol=TLSv1 --ignore-ssl-errors=true --web-security=false

SSL/TLS issues:

SPIDERABLE_FLAGS="--ssl-protocol=any"

Caching - minor speed increase (make sure /data/phantomjs directory exists and writable):

SPIDERABLE_FLAGS="--disk-cache=true --disk-cache-path=/data/phantomjs"

Cookies and localStorage (make sure /data/phantomjs directory exists and writable):

SPIDERABLE_FLAGS="--cookies-file=/data/phantomjs/cookies.txt --local-storage-path=/data/phantomjs"

AppCache (make sure /data/phantomjs directory exists and writable):

SPIDERABLE_FLAGS="--offline-storage-path=/data/phantomjs"

XHR and parent <-> child window access:

SPIDERABLE_FLAGS="--local-to-remote-url-access=true"

All flags (make sure /data/phantomjs directory and /data/phantomjs/cookies.txt file exists and writable):

SPIDERABLE_FLAGS="--load-images=false --ssl-protocol=any --ignore-ssl-errors=true --disk-cache=true --disk-cache-path=/data/phantomjs --cookies-file=/data/phantomjs/cookies.txt --local-storage-path=/data/phantomjs --local-to-remote-url-access=true --offline-storage-path=/data/phantomjs --web-security=false"

Usage:

# To start process with env.var
SPIDERABLE_FLAGS="--load-images=false --ssl-protocol=any --ignore-ssl-errors=true" meteor

# Set temporary env.var
export SPIDERABLE_FLAGS="--load-images=false --ssl-protocol=any --ignore-ssl-errors=true"

Within Phusion Passenger:

server {
  passenger_env_var SPIDERABLE_FLAGS "--load-images=false --ssl-protocol=any --ignore-ssl-errors=true";
}
isReadyForSpiderable {Boolean}

On server and client, this instructs Spiderable that everything is ready. Spiderable will wait for Meteor.isReadyForSpiderable to be true, which allows for finer control about when content is ready to be published.

Router.onAfterAction( function () {
  if (this.ready()) {
    Meteor.isReadyForSpiderable = true;
  }
});

Options

userAgentRegExps {[RegExp]}

An array of Regular Expressions, of bot's user agents that we want to serve statically, but do not obey the _escaped_fragment_ protocol. Optionally set or extend Spiderable.userAgentRegExps list.

Spiderable.userAgentRegExps.push(/^vkShare/i);

Default Bots:

cacheLifetimeInMinutes (Cache TTL) {Number}

How long cached Spiderable results should be stored (in minutes). Note:

Spiderable.cacheLifetimeInMinutes = 60; // 1 hour in minutes

If you want to change your cache lifetime, first - drop the cache index. To drop the cache index, run in Mongo console:

db.SpiderableCacheCollection.dropIndex('createdAt_1');
/* or */
db.SpiderableCacheCollection.dropIndexes();
ignoredRoutes {[String]}

Spiderable.ignoredRoutes - is array of strings, routes that we want to serve statically, but do not obey the _escaped_fragment_ protocol. This is a server only parameter. For more info see this thread.

Spiderable.ignoredRoutes.push('/cdn/storage/Files/');
customQuery {Boolean|String}

Spiderable.customQuery - additional get query will be appended to http request. This option may help to build different client's logic for requests from phantomjs and normal users

Spiderable.customQuery = true;
// or
Spiderable.customQuery = '_fromPhantom_'

// Usage:
Router.onAfterAction( function () {
  if(Meteor.isClient && _.has(this.params.query, '___isRunningPhantomJS___') {
    Session.set('___isRunningPhantomJS___', true);
  }
});
debug {Boolean}

Show/hide server's console messages, set Spiderable.debug to true to show server's console messages

Spiderable.debug = true;
bufferSize {Number}

Memory allocation for PhantomJS (in bytes). Note:

Spiderable.bufferSize = 10 * 1024 * 1024; // 10MB in bytes
requestTimeout {Number}

Request timeout length. Note:

Spiderable.requestTimeout = 30 * 1000; // 30 seconds in minutes
Response statuses

You able to send any response status from phantomjs, this behavior may be easily controlled via special HTML/JADE comment:

This directive accepts any 3-digit value, so you may return any standard or custom response code.

Enable default 404 response if you're using Iron-Router
Router.configure({
  notFoundTemplate: '_404'
});

Router.plugin('dataNotFound', {
  notFoundTemplate: Router.options.notFoundTemplate
});
template(name="_404")
  // response:status-code=404
  h1 404
  h3 Oops, page not found
  p Sorry, page you're requested is not exists or was deleted
<template name="_404">
  <!--response:status-code=404-->
  <h1>404</h1>
  <h3>Oops, page not found</h3>
  <p>Sorry, page you're requested is not exists or was deleted</p>
</template>
Enable default 404 response if you're using Flow-Router
// With layout
FlowRouter.notFound = {
  action() {
    BlazeLayout.render('_layout', {content: '_404'});
  }
}

// Without layout
FlowRouter.notFound = {
  action() {
    BlazeLayout.render('_404');
  }
}
template(name="_404")
  // response:status-code=404
  h1 404
  h3 Oops, page not found
  p Sorry, page you're requested is not exists or was deleted
<template name="_404">
  <!--response:status-code=404-->
  <h1>404</h1>
  <h3>Oops, page not found</h3>
  <p>Sorry, page you're requested is not exists or was deleted</p>
</template>
Supported redirects
window.location.href = 'http://example.com/another/page';
window.location.replace 'http://example.com/another/page';

Router.go('/another/page');
Router.current().redirect('/another/page');
Router.route('/one', function () {
  this.redirect('/another/page');
});

Important

Set Meteor.isReadyForSpiderable to true when your route is finished, in order to publish. Deprecated Meteor.isRouteComplete=true, but it will work until at least 2015-12-31 after which I'll remove it... See code for details

Install PhantomJS on your server

If you deploy your application with meteor bundle, you must install phantomjs (http://phantomjs.org) somewhere in your $PATH. If you use Meteor Up, then meteor deploy can do this for you.

Spiderable.originalRequest is also set to the http request. See issue 1.

Testing

Test your site by appending a query to your URLs: URL?_escaped_fragment_= as in http://your.site.com/path_escaped_fragment_=

curl

curl your localhost or host name, if you on production, like:

curl http://localhost:3000/?_escaped_fragment_=
curl http://localhost:3000/ -A googlebot
Google Tools: Fetch as Google

Use Fetch as Google tools to scan your site. Tips:

Testing PhantomJS

PhantomJS can be temperamental, and can be a challenge to work with.

If PhantomJS is failing on your server, you can try running it directly to help debug what is broken.

On the server console, try running phantomjs --version

Also, you can run this package's PhantomJS script. In order to do so, you'd need to find the phantom_script.js file.

# Find phantom_script.js
PHANTOM_SCRIPT=$(find /opt/YOUR_WEB_APP/app/ -name phantom_script.js)
# Verify that you found just one
echo ${PHANTOM_SCRIPT}
# Try running phantomjs with that script
phantomjs --load-images=no --ssl-protocol=TLSv1 --ignore-ssl-errors=true --web-security=false ${PHANTOM_SCRIPT}    http://localhost
# Verify that it succeeded (should return 0)
echo $?

From Meteor's original Spiderable documentation. See notes specific to this branch (above).

spiderable is part of Webapp. It's one possible way to allow web search engines to index a Meteor application. It uses the AJAX Crawling specification published by Google to serve HTML to compatible spiders (Google, Bing, Yandex, and more).

When a spider requests an HTML snapshot of a page the Meteor server runs the client half of the application inside phantomjs, a headless browser, and returns the full HTML generated by the client code.

In order to have links between multiple pages on a site visible to spiders, apps must use real links (eg <a href="https://github.com/veliovgroup/jazeee-meteor-spiderable/blob/master/about">) rather than simply re-rendering portions of the page when an element is clicked. Apps should render their content based on the URL of the page and can use HTML5 pushState to alter the URL on the client without triggering a page reload. See the Todos example for a demonstration.

When running your page, spiderable will wait for all publications to be ready. Make sure that all of your publish functions either return a cursor (or an array of cursors), or eventually call this.ready(). Otherwise, the phantomjs executions will fail.