medialab / artoo

artoo.js - the client-side scraping companion.
http://medialab.github.io/artoo/
MIT License
1.1k stars 93 forks source link

Spider with recursive scrape not crawling as expected #245

Open danielelodola opened 8 years ago

danielelodola commented 8 years ago

I have the following spider :

var scraper = {
    iterator : 'body', 
    data: {
        company_name: {sel: '.container > .row > .col-xs-12 > .row > .col-xs-12 > h1.pull-left', method: function($) {return $(this).text().trim().replace(/[\(\)]/g, '').replace('XYZ','')}},
        company_details: {scrape: {iterator: 'ul.list-unstyled .wrap-1', data: 'text'}},
        details_labels: {scrape: {iterator: '#home > dl:not(#shipping > dl) dt', data: 'text'}},
        details_values: {scrape: {iterator: '#home > dl:not(#shipping > dl) dd', data: {method: function($) {
            return $(this).text().trim().replace(/[\(\)]/g, '').replace('XYZ','')}}}}
        }
};

artoo.log.debug('Starting the scraper...');
var initialList = artoo.scrape(scraper);

artoo.ajaxSpider(
    ['URL1','URL2'], 
    {
        scrape: scraper,
        concat: true,
        done: function(data){
            artoo.log.debug('Finished retrieving data. Downloading...');
            artoo.savePrettyJson(initialList.concat(data),{filename: 'output_file.json'});
            }
    }
);

It extracts the expected data (company_name, company_details, details_labels and details_values) as required, but only on ONE page. The spider is not actually crawling the list of URLs I provide it with.

Where am I going wrong?

Thanks a bunch for your help!

Dan

Yomguithereal commented 8 years ago

Hello @danielelodola. Can you tell me which page & urls you are trying to scrape so I can assess the root of the problem please?

danielelodola commented 8 years ago

Hello; the pages require authentication in order to be accessed. How can I go about sharing the info with you?

Yomguithereal commented 8 years ago

Are you able to reproduce the issue on a different site? Other quick question, does the scraped content you are crawling with the spider need js execution on the page?

danielelodola commented 8 years ago

I have not tried to replicate the issue on other sites. I have been able to crawl the site I'm trying to extract data from however with a simpler scraper model (retrieving only one data element, without embedding recursive iterators). The content I'm trying to retrieve is not js generated and is plain and simple html.

Yomguithereal commented 8 years ago

And so the problem does not occur when not using recursive scrapers?

danielelodola commented 8 years ago

No it does not. I'm able to retrieve

company_name: {sel: '.container > .row > .col-xs-12 > .row > .col-xs-12 > h1.pull-left', method: function($) {return $(this).text().trim().replace(/[\(\)]/g, '').replace('XYZ','')}},

on multiple pages for example.

BTW, I have downloaded a local copy of a page if this can help.

Yomguithereal commented 8 years ago

Can you try to replace the scrape recursive things with applying a function doing the scrape:

// Something along this:
{
  field: function($) { return $(this).scrape(...); }
}

and tell me whether this works or not?

danielelodola commented 8 years ago

Noted, I will try this and keep you posted. Thanks for taking the time to look into this issue ;-).

danielelodola commented 8 years ago

Hi @Yomguithereal, no luck whatsoever with the function($) approach! I just can't wrap my brain around it.

Yomguithereal commented 8 years ago

For instance, instead of

company_details: {scrape: {iterator: 'ul.list-unstyled .wrap-1', data: 'text'}}

you can write

company_details: function($) {
  return $('ul.list-unstyled .wrap-1').scrape();
}