ruipgil / scraperjs

A complete and versatile web scraper.
MIT License
3.7k stars 188 forks source link

Scraperjs

Build Status Dependency Status Coverage Status NPM version Inline docs

Scraperjs is a web scraper module that make scraping the web an easy job.

Installing

npm install scraperjs

If you would like to test (this is optional and requires the installation with the --save-dev tag),

grunt test

To use some features you’ll need to install phantomjs, if you haven’t already

Getting started

Scraperjs exposes two different scrapers,

Lets scrape Hacker News, with both scrapers.

Try to spot the differences.

Static Scraper

var scraperjs = require('scraperjs');
scraperjs.StaticScraper.create('https://news.ycombinator.com/')
    .scrape(function($) {
        return $(".title a").map(function() {
            return $(this).text();
        }).get();
    })
    .then(function(news) {
        console.log(news);
    })

The scrape promise receives a function that will scrape the page and return the result, it only receives jQuery a parameter to scrape the page. Still, very powerful. It uses cheerio to do the magic behind the scenes.

Dynamic Scraper

var scraperjs = require('scraperjs');
scraperjs.DynamicScraper.create('https://news.ycombinator.com/')
    .scrape(function($) {
        return $(".title a").map(function() {
            return $(this).text();
        }).get();
    })
    .then(function(news) {
        console.log(news);
    })

Again, the scrape promise receives a function to scrape the page, the only difference is that, because we're using a dynamic scraper, the scraping function is sandboxed only with the page scope, so no closures! This means that in this (and only in this) scraper you can't call a function that has not been defined inside the scraping function. Also, the result of the scraping function must be JSON-serializable. We use phantom and phantomjs to make it happen, we also inject jQuery for you.

However, it's possible to pass JSON-serializable data to any scraper.

The $ varible received by the scraping function is, only for the dynamic scraper, hardcoded.

Show me the way! (aka Routes)

For a more flexible scraping and crawling of the web sometimes we need to go through multiple web sites and we don't want map every possible url format. For that scraperjs provides the Router class.

Example

var scraperjs = require('scraperjs'),
    router = new scraperjs.Router();

router
    .otherwise(function(url) {
    console.log("Url '"+url+"' couldn't be routed.");
});

var path = {};

router.on('https?://(www.)?youtube.com/watch/:id')
    .createStatic()
    .scrape(function($) {
        return $("a").map(function() {
            return $(this).attr("href");
        }).get();
    })
    .then(function(links, utils) {
        path[utils.params.id] = links
    })

router.route("https://www.youtube.com/watch/YE7VzlLtp-4", function() {
    console.log("i'm done");
});

Code that allows for parameters in paths is from the project Routes.js, information about the path formating is there too.

API overview

Scraperjs uses promises whenever possible.

StaticScraper, DynamicScraper and ScraperPromise

So, the scrapers should be used with the ScraperPromise. By creating a scraper

var scraperPromise = scraperjs.StaticScraper.create() // or DynamicScraper

The following promises can be made over it, they all return a scraper promise,

All callback functions receive as their last parameter a utils object, with it the parameters of an url from a router can be accessed. Also the chain can be stopped.

DynamicScraper.create()
    .get("http://news.ycombinator.com")
    .then(function(_, utils) {
        utils.stop();
        // utils.params.paramName
    });

The promise chain is fired with the same sequence it was declared, with the exception of the promises get and request that fire the chain when they've received a valid response, and the promises done and catch, which were explained above.

You can also waterfall values between promises by returning them (with the exception of the promise timeout, that will always return undefined) and it can be access through utils.lastReturn.

The utils object

You've seen the utils object that is passed to promises, it provides useful information and methods to your promises. Here's what you can do with it:

A more powerful DynamicScraper.

When lots of instances of DynamicScraper are needed, it's creation gets really heavy on resources and takes a lot of time. To make this more lighter you can use a factory, that will create only one PhantomJS instance, and every DynamicScraper will request a page to work with. To use it you must start the factory before any DynamicSrcaper is created, scraperjs.DynamicScraper.startFactory() and then close the factory after the execution of your program, scraperjs.DynamicScraper.closeFactory(). To make the scraping function more robust you can inject code into the page,

var ds = scraperjs.DynamicScraper
    .create('http://news.ycombinator.com')
    .async(function(_, done, utils) {
        utils.scraper.inject(__dirname+'/path/to/code.js', function(err) {
            // in this case if there was an error won't fire catch promise.
            if(err) {
                done(err);
            } else {
                done();
            }
        });
    })
    .scrape(function() {
        return functionInTheCodeInjected();
    })
    .then(function(result) {
        console.log(result);
    });

Router

The router should be initialized like a class

var router = new scraperjs.Router(options);

The options object is optional, and these are the options:

The following promises can be made over it,

Notes

More

Check the examples, the tests or just dig into the code, it's well documented and it's simple to understand.

Dependencies

As mentioned above, scraperjs is uses some dependencies to do the the heavy work, such as

License

This project is under the MIT license.