ruipgil / scraperjs

A complete and versatile web scraper.
MIT License
3.71k stars 188 forks source link

how to scrape a list of urls #11

Closed gabrielflorit closed 10 years ago

gabrielflorit commented 10 years ago

Hi there,

How would I go about scraping a list of urls? I'm a bit stuck.

Thanks,

Gabriel

Prinzhorn commented 10 years ago

I'm sure you can combine these promises in an elegant way, but I'd use async as usual.

Untested and lacks proper error handling (I just randomly came across this repo)

var async = require('async');
var scraperjs = require('scraperjs');
var StaticScraper = scraperjs.StaticScraper;

var urls = [
    ['https://news.ycombinator.com/', function($) {
        return $('a').map(function() {
            return $(this).text();
        }).get();
    }],
    ['https://www.google.com/', function($) {
        return $('input').map(function() {
            return $(this).val();
        }).get();
    }]
];

async.mapSeries(urls, function(url, callback) {
    StaticScraper.create(url[0]).scrape(url[1], function(content) {
        callback(null, content);
    })
}, function(err, contents) {
    console.log(contents);
});

You could also use mapLimit to do this in parallel in a deterministic matter (I wouldn't use map with an unknown number of urls).

ruipgil commented 10 years ago

Use a router, then, just iterate through a list of URL calling the route method (you can use asyc to make it synchronous). You can then use otherwise to store the URLs without a path for later use.