medialab / sandcrawler

sandcrawler.js - the server-side scraping companion.
http://medialab.github.io/sandcrawler/
GNU Lesser General Public License v3.0
107 stars 12 forks source link

Can you do child requests? #184

Closed kevinrademan closed 4 years ago

kevinrademan commented 8 years ago

I was wondering if you had considered allowing "sub-requests" per url.

I hit page A to get all product data including the product id Then I need to hit page B to get the product availability data using the product id from page A.

Ideally I'd like the availability to be included in the response from page A.

Would something like this be possible?

Yomguithereal commented 8 years ago

You cannot do it per se but you can still emulate this by building your data in the result callback nonetheless. Typically, people use an external data variable holding the scraped data and building it while crawling. Let's say this data variable is an object with keys being your product id, based on urls or arbitrary data you pass to you jobs, you could very well complete your data within the result callback.

I am not sure I make sense. Tell me if you understand what I mean. If not, I'll write an example :smile:.

kevinrademan commented 8 years ago

Yep that makes perfect sense. I'm busy changing my code now. I did also find a dirty workaround (for testing only).

The idea is basically that you create a nested scraper in the scrape callback. This scraper then hits the 2nd url and calls the parent scrapers "done" method in the result callback.

var sandcrawler = require('sandcrawler')

var spider = sandcrawler.spider()
    //.use(dashboard())
    .url('http://urlgoeshere/de/product/product.html')
    .scraper(function($, done) {
        var data = {
            id: $("[name=somename]").data("catentryid"),
            attributes: $('#features section').scrape({
                group: {
                    sel: 'h2',
                    method: 'html'
                },
                items: function() {
                    return $(this).find('dt').scrape({
                        title: 'text',
                        value: function() {
                            return $(this).next().html()
                        }
                    });
                }
            })
        };

        sandcrawler.spider()
            //.use(dashboard())
            .url({
                url: 'http://urlgoeshere/avdata?catEntryId=' + data.id
            })
            .result(function(err, req, res) {
                data.offers = JSON.parse(res.body).markets;
                done(null, data);
            })
            .run(function(err, remains) {
                console.log('And we are done!');
            });

    })
    .result(function(err, req, res) {
        console.log('Scraped data:', JSON.stringify(res.data));
    })
    .run(function(err, remains) {
        //console.log('And we are done!');
    });
Yomguithereal commented 8 years ago

Yes this works also. While I'll admit it feels a bit convoluted. This won't work however if you are using a phantom spider.

Yomguithereal commented 8 years ago

May I ask you, if it is not indiscreet, how you found this library?

kevinrademan commented 8 years ago

I'be currently looking into a few different scraping frameworks and found it on https://www.npmjs.com/package/sandcrawler