This is a fork of the standard meteor spiderable
package, with some merged code from ongoworks:spiderable
package. Primarily, this lengthens the timeout to 30 seconds and size limit to 10MB. All results will be cached to Mongo collection, by default for 3 hours (180 minutes).
This package will ignore all SSL error in favor of page fetching.
This package supports "real response-code" and "real headers", this means if your route returns 301
response code with some headers the package will return the same headers. This package also has support for JavaScript redirects.
phantomjs
and consequently this package doesn't support ES6 (ECMAScript 2015), if you're not compiling ES6 to ES5, or using NPM packages written in ES6 (Meteor doesn't compile NPM packages) it will result in blank pages after rendering. There is no easy way to solve it with drop-in package/solution. We recommend to solve it with prerendering by ostr.io, which has ES6 (ECMAScript 2015) support and can be installed with one command.
This package tested with iron-router, flow-router, and flow-router-extra with and without next packages:
This package has build-in caching mechanism, by default it stores results for 3 hours, to change storing period set Spiderable.cacheLifetimeInMinutes
to other value in minutes.
meteor add jazeee:spiderable-longer-timeout
import { Spiderable } from 'meteor/jazeee:spiderable-longer-timeout';
SPIDERABLE_FLAGS
environment variableIssues like select: Invalid argument
can be easily solved with additional phantomjs
process flags (arguments).
Default flags:
phantomjs --load-images=no --ssl-protocol=TLSv1 --ignore-ssl-errors=true --web-security=false
SSL/TLS issues:
SPIDERABLE_FLAGS="--ssl-protocol=any"
Caching - minor speed increase (make sure /data/phantomjs
directory exists and writable):
SPIDERABLE_FLAGS="--disk-cache=true --disk-cache-path=/data/phantomjs"
Cookies and localStorage (make sure /data/phantomjs
directory exists and writable):
SPIDERABLE_FLAGS="--cookies-file=/data/phantomjs/cookies.txt --local-storage-path=/data/phantomjs"
AppCache (make sure /data/phantomjs
directory exists and writable):
SPIDERABLE_FLAGS="--offline-storage-path=/data/phantomjs"
XHR and parent <-> child window access:
SPIDERABLE_FLAGS="--local-to-remote-url-access=true"
All flags (make sure /data/phantomjs
directory and /data/phantomjs/cookies.txt
file exists and writable):
SPIDERABLE_FLAGS="--load-images=false --ssl-protocol=any --ignore-ssl-errors=true --disk-cache=true --disk-cache-path=/data/phantomjs --cookies-file=/data/phantomjs/cookies.txt --local-storage-path=/data/phantomjs --local-to-remote-url-access=true --offline-storage-path=/data/phantomjs --web-security=false"
Usage:
# To start process with env.var
SPIDERABLE_FLAGS="--load-images=false --ssl-protocol=any --ignore-ssl-errors=true" meteor
# Set temporary env.var
export SPIDERABLE_FLAGS="--load-images=false --ssl-protocol=any --ignore-ssl-errors=true"
Within Phusion Passenger:
server {
passenger_env_var SPIDERABLE_FLAGS "--load-images=false --ssl-protocol=any --ignore-ssl-errors=true";
}
On server and client, this instructs Spiderable that everything is ready. Spiderable will wait for Meteor.isReadyForSpiderable
to be true
, which allows for finer control about when content is ready to be published.
Router.onAfterAction( function () {
if (this.ready()) {
Meteor.isReadyForSpiderable = true;
}
});
An array of Regular Expressions, of bot's user agents that we want to serve statically, but do not obey the _escaped_fragment_ protocol
. Optionally set or extend Spiderable.userAgentRegExps
list.
Spiderable.userAgentRegExps.push(/^vkShare/i);
Default Bots:
/360spider/i
/adsbot-google/i
/ahrefsbot/i
/applebot/i
/baiduspider/i
/bingbot/i
/duckduckbot/i
/facebookbot/i
/facebookexternalhit/i
/google-structured-data-testing-tool/i
/googlebot/i
/instagram/i
/kaz\.kz_bot/i
/linkedinbot/i
/mail\.ru_bot/i
/mediapartners-google/i
/mj12bot/i
/msnbot/i
/msrbot/i
/oovoo/i
/orangebot/i
/pinterest/i
/redditbot/i
/sitelockspider/i
/skypeuripreview/i
/slackbot/i
/sputnikbot/i
/tweetmemebot/i
/twitterbot/i
/viber/i
/vkshare/i
/whatsapp/i
/yahoo/i
/yandex/
How long cached Spiderable results should be stored (in minutes). Note:
Meteor.startup
createdAt_1
.Spiderable.cacheLifetimeInMinutes = 60; // 1 hour in minutes
If you want to change your cache lifetime, first - drop the cache index. To drop the cache index, run in Mongo console:
db.SpiderableCacheCollection.dropIndex('createdAt_1');
/* or */
db.SpiderableCacheCollection.dropIndexes();
Spiderable.ignoredRoutes
- is array of strings, routes that we want to serve statically, but do not obey the _escaped_fragment_
protocol. This is a server only parameter.
For more info see this thread.
Spiderable.ignoredRoutes.push('/cdn/storage/Files/');
Spiderable.customQuery
- additional get
query will be appended to http request.
This option may help to build different client's logic for requests from phantomjs and normal users
true
- Spiderable will append ___isRunningPhantomJS___=true
to the queryString
- Spiderable will append String=true
to the querySpiderable.customQuery = true;
// or
Spiderable.customQuery = '_fromPhantom_'
// Usage:
Router.onAfterAction( function () {
if(Meteor.isClient && _.has(this.params.query, '___isRunningPhantomJS___') {
Session.set('___isRunningPhantomJS___', true);
}
});
Show/hide server's console messages, set Spiderable.debug
to true
to show server's console messages
false
Spiderable.debug = true;
Memory allocation for PhantomJS (in bytes). Note:
Meteor.startup
Spiderable.bufferSize = 10 * 1024 * 1024; // 10MB in bytes
Request timeout length. Note:
Meteor.startup
Spiderable.requestTimeout = 30 * 1000; // 30 seconds in minutes
You able to send any response status from phantomjs, this behavior may be easily controlled via special HTML
/JADE
comment:
201
- <!-- response:status-code=201 -->
401
- <!-- response:status-code=401 -->
403
- <!-- response:status-code=403 -->
500
- <!-- response:status-code=500 -->
This directive accepts any 3-digit value, so you may return any standard or custom response code.
404
response if you're using Iron-RouternotFoundTemplate
<!-- response:status-code=404 -->
on your template. This way, we can ensure spiderable sends a 404
status code in the response headersdataNotFound
plugin. See below or read more about iron-router pluginsRouter.configure({
notFoundTemplate: '_404'
});
Router.plugin('dataNotFound', {
notFoundTemplate: Router.options.notFoundTemplate
});
template(name="_404")
// response:status-code=404
h1 404
h3 Oops, page not found
p Sorry, page you're requested is not exists or was deleted
<template name="_404">
<!--response:status-code=404-->
<h1>404</h1>
<h3>Oops, page not found</h3>
<p>Sorry, page you're requested is not exists or was deleted</p>
</template>
404
response if you're using Flow-Router<!-- response:status-code=404 -->
on your template. This way, we can ensure spiderable sends a 404
status code in the response headersnotFound
property. See below or read more about flow-router not found routes// With layout
FlowRouter.notFound = {
action() {
BlazeLayout.render('_layout', {content: '_404'});
}
}
// Without layout
FlowRouter.notFound = {
action() {
BlazeLayout.render('_404');
}
}
template(name="_404")
// response:status-code=404
h1 404
h3 Oops, page not found
p Sorry, page you're requested is not exists or was deleted
<template name="_404">
<!--response:status-code=404-->
<h1>404</h1>
<h3>Oops, page not found</h3>
<p>Sorry, page you're requested is not exists or was deleted</p>
</template>
window.location.href = 'http://example.com/another/page';
window.location.replace 'http://example.com/another/page';
Router.go('/another/page');
Router.current().redirect('/another/page');
Router.route('/one', function () {
this.redirect('/another/page');
});
Set Meteor.isReadyForSpiderable
to true
when your route is finished, in order to publish.
Deprecated Meteor.isRouteComplete=true
, but it will work until at least 2015-12-31 after which I'll remove it...
See code for details
If you deploy your application with meteor bundle
, you must install
phantomjs (http://phantomjs.org) somewhere in your
$PATH
. If you use Meteor Up, then meteor deploy
can do this for you.
Spiderable.originalRequest
is also set to the http request. See issue 1.
Test your site by appending a query to your URLs: URL?_escaped_fragment_=
as in http://your.site.com/path_escaped_fragment_=
curl
your localhost
or host name, if you on production, like:
curl http://localhost:3000/?_escaped_fragment_=
curl http://localhost:3000/ -A googlebot
Use Fetch as Google
tools to scan your site. Tips:
Fetch as Google
and observe that it takes 3-5 minutes before displaying results.
# Simple test with test=1 query
curl "http://localhost:3002/blogs?_escaped_fragment_=&test=1"
# Set the date in the query, which will show up in Meteor logs, with a unique date. (Turn on `Spiderable.debug=true`)
TEST=`date "+%Y%m%d-%H%M%S"`; echo $TEST; curl "http://localhost:3000/blogs?_escaped_fragment_=&test=${TEST}"
Interpreting Fetch as Google
results:
?_escaped_fragment_=
component.Spiderable successfully completed
.site:your.site.com
PhantomJS can be temperamental, and can be a challenge to work with.
If PhantomJS is failing on your server, you can try running it directly to help debug what is broken.
On the server console, try running phantomjs --version
Also, you can run this package's PhantomJS script. In order to do so, you'd need to find the phantom_script.js file.
# Find phantom_script.js
PHANTOM_SCRIPT=$(find /opt/YOUR_WEB_APP/app/ -name phantom_script.js)
# Verify that you found just one
echo ${PHANTOM_SCRIPT}
# Try running phantomjs with that script
phantomjs --load-images=no --ssl-protocol=TLSv1 --ignore-ssl-errors=true --web-security=false ${PHANTOM_SCRIPT} http://localhost
# Verify that it succeeded (should return 0)
echo $?
spiderable
is part of Webapp. It's
one possible way to allow web search engines to index a Meteor
application. It uses the AJAX Crawling
specification
published by Google to serve HTML to compatible spiders (Google, Bing,
Yandex, and more).
When a spider requests an HTML snapshot of a page the Meteor server runs the client half of the application inside phantomjs, a headless browser, and returns the full HTML generated by the client code.
In order to have links between multiple pages on a site visible to spiders, apps
must use real links (eg <a href="https://github.com/veliovgroup/jazeee-meteor-spiderable/blob/master/about">
) rather than simply re-rendering
portions of the page when an element is clicked. Apps should render their
content based on the URL of the page and can use HTML5
pushState
to alter the URL on the client without triggering a page reload. See the Todos
example for a demonstration.
When running your page, spiderable
will wait for all publications
to be ready. Make sure that all of your publish functions
either return a cursor (or an array of cursors), or eventually call
this.ready()
. Otherwise, the phantomjs
executions
will fail.