nrabinowitz / pjscrape

A web-scraping framework written in Javascript, using PhantomJS and jQuery
http://nrabinowitz.github.io/pjscrape/
MIT License
995 stars 158 forks source link

Nested Scrapes #38

Closed nodeGarden closed 11 years ago

nodeGarden commented 11 years ago

I'm trying to figure out how to do a nested scrape which relies on data from the first scraper in the second.

I'm pulling the Artist names from: http://www.billboard.com/artists/top-100?page=0 This part works:

pjs.addSuite({
    url: 'http://www.billboard.com/artists/top-100?page=0',
    scraper: function() {
        var artists=[]; 
        $(".artist-top-100 h1 > a").each(function(i,el){ 
            artists.push( {'name':$(el).text(), 'url':$(el).attr("href")} ); 
        }); 
        return artists;  
    }
});

I then want to go into the individual Artist's page and grab the top songs: http://www.billboard.com/artist/371422/taylor-swift

Individually this works too:

pjs.addSuite({
    url: 'http://www.billboard.com/artist/371422/taylor-swift',
    scraper: function() {
        var songs=[]; 
        $(".module_chart_position b").each(function(i,el){ 
            songs.push( $(el).text() ); 
        }); 
        return songs;  
    }
});

but what I want to get is the return from scrape #2 as a part of the return for scrape #1, so that it looks more like:

[
    {
      name: 'Taylor Swift',
      url: '/artist/371422/taylor-swift',
      songs: ['I Knew You Were Trouble', '22']
    }
    ...
]

When I try and nest, then it says _name and _url are undefined.

pjs.addSuite({
    url: 'http://www.billboard.com/artists/top-100?page=0',
    scraper: function() {
        var artists=[]; 
        $(".artist-top-100 h1 > a:eq(0)").each(function(i,el){ 
            var _name = $(el).text();
            var _url = $(el).attr("href");
            artists.push( {'name':_name, 'url':_url} ); 
        });

        (function(_name,_url){
            pjs.addSuite({
                url: _url,
                scraper: function() {
                    var songs=[]; 
                    $(".module_chart_position b").each(function(i,el){ 
                        songs.push( $(el).text() ); 
                    }); 
                    return songs;  
                }
            });
        })(_name, _url);

        return artists;  
    }
});

Result: image

I see the note on the Documentation page about the private scope, and I don't quite understand how to apply the evaluate suggested. I guess the question is: Is there a work around to this, or is there another way to accomplish the above?

nrabinowitz commented 11 years ago

Yeah, you won't be able to invoke anything involving Pjscrape from inside a scraper function - it's completely sandboxed, with no access to the PhantomJS environment.

You want the moreUrls option - see any of the "recursive" tests for examples (https://github.com/nrabinowitz/pjscrape/tree/master/tests)

chrisribe commented 11 years ago

Sorry to bump this issue but I too am having issues running a nested scrape. I have looked at the moreUrls option and the recursive tests examples but still cannot seem to get it to work.

I don't get, how do you pipe the urls from the first scrape to the moreUrls parameter ? Is it at all possible ?

Thanks Chris

nrabinowitz commented 11 years ago

moreUrls takes a function, which is executed in the remote context and should return a list of URL strings - this isn't working?

nrabinowitz commented 11 years ago

(It can also just take a selector, in the simple case.)

chrisribe commented 11 years ago

Hi,

I got the moreUrls part extracting the urls but it seems there is some ajax/javascript happening, because I cannot open them manually in a browser. (Prob why it's not working)

The page loads the result dynamically, can I trigger a click event for each link element ? Then extract the data and continue with the next one? (Page.goBack()??)

Thank you for your help. Chris

nrabinowitz commented 11 years ago

If they aren't really URLs (i.e. PhantomJS can't load them), then you'll likely need to handle the entire thing within your scraper function, triggering the click and return events there. Hard to offer more help w/o seeing the site in question.

chrisribe commented 11 years ago

Hi thanks for the reply,

I am trying to extract the google trends page items. I cant get all categories fine like this.

pjs.addSuite({
    url: 'http://www.google.ca/trends/topcharts',
    scraper: function() 
    {
        return $('.topcharts-category-charts-container').children().map(function() 
        {
            return _pjs.toFullUrl($(this).find("a.topcharts-smallchart-title-link").attr("href"));
        }).toArray();
    }
});

But If I try to run it with moreUrls to get the category details:

pjs.addSuite({
    url: 'http://www.google.ca/trends/topcharts',
    moreUrls: function() {
        return $('.topcharts-category-charts-container').children().map(function() 
        {
            return _pjs.toFullUrl($(this).find("a.topcharts-smallchart-title-link").attr("href"));
        }).toArray();
    },
    scraper : function(){
        return $('.common-title-text').first().text();
    }
});

I get page did not load errors, well thats normal since google does not let you load the urls directly. Any ideas or suggestions ? Thanks