Closed kevinrademan closed 4 years ago
You cannot do it per se but you can still emulate this by building your data in the result
callback nonetheless. Typically, people use an external data
variable holding the scraped data and building it while crawling. Let's say this data
variable is an object with keys being your product id, based on urls or arbitrary data you pass to you jobs, you could very well complete your data within the result
callback.
I am not sure I make sense. Tell me if you understand what I mean. If not, I'll write an example :smile:.
Yep that makes perfect sense. I'm busy changing my code now. I did also find a dirty workaround (for testing only).
The idea is basically that you create a nested scraper in the scrape callback. This scraper then hits the 2nd url and calls the parent scrapers "done" method in the result callback.
var sandcrawler = require('sandcrawler')
var spider = sandcrawler.spider()
//.use(dashboard())
.url('http://urlgoeshere/de/product/product.html')
.scraper(function($, done) {
var data = {
id: $("[name=somename]").data("catentryid"),
attributes: $('#features section').scrape({
group: {
sel: 'h2',
method: 'html'
},
items: function() {
return $(this).find('dt').scrape({
title: 'text',
value: function() {
return $(this).next().html()
}
});
}
})
};
sandcrawler.spider()
//.use(dashboard())
.url({
url: 'http://urlgoeshere/avdata?catEntryId=' + data.id
})
.result(function(err, req, res) {
data.offers = JSON.parse(res.body).markets;
done(null, data);
})
.run(function(err, remains) {
console.log('And we are done!');
});
})
.result(function(err, req, res) {
console.log('Scraped data:', JSON.stringify(res.data));
})
.run(function(err, remains) {
//console.log('And we are done!');
});
Yes this works also. While I'll admit it feels a bit convoluted. This won't work however if you are using a phantom spider.
May I ask you, if it is not indiscreet, how you found this library?
I'be currently looking into a few different scraping frameworks and found it on https://www.npmjs.com/package/sandcrawler
I was wondering if you had considered allowing "sub-requests" per url.
I hit page A to get all product data including the product id Then I need to hit page B to get the product availability data using the product id from page A.
Ideally I'd like the availability to be included in the response from page A.
Would something like this be possible?