nrabinowitz / pjscrape

A web-scraping framework written in Javascript, using PhantomJS and jQuery
http://nrabinowitz.github.io/pjscrape/
MIT License
997 stars 159 forks source link

Hash fragments don't seem to be supported by PhantomJS #25

Closed simonexmachina closed 12 years ago

simonexmachina commented 12 years ago

I've posted this on the PhantomJS Google Group, but I thought I'd ask here as well in case you knew the answer.

Run the following script using: phantomjs test.coffee

page = require('webpage').create()
page.onResourceReceived = (response) ->
  console.log('Received ' + response.url)
url = 'http://fiddle.jshell.net/simonvwade/wpstb/11/show/'
page.open url, (status) ->
  console.log 'Finished loading'
  document.location.href = '#!foo'

The fiddle that is loaded (see http://jsfiddle.net/simonvwade/wpstb/11/) makes a request using AJAX whenever the hash fragment changes (try clicking the link). However this doesn't happen when document.location.href is called in PhantomJS.

I would expect to see "Received ...normalize.css" showing after "Finished loading"

nrabinowitz commented 12 years ago

I don't think document.location.href is the right way to switch hash tags. You may be able to set document.location.hash directly, or you can page.open the URL with its hash. Hash fragments do work in PhantomJS - I've tested several Backbone apps, and they work fine.

simonexmachina commented 12 years ago

Thanks Nick, appreciate the response. Two things:

  1. Changing to use document.location.hash or including the hash in the call to page.open doesn't fix the problem:

    page = require('webpage').create()
    page.onResourceReceived = (response) ->
     console.log('Received ' + response.url)
    url = 'http://fiddle.jshell.net/simonvwade/wpstb/11/show/'
    page.open url, (status) ->
     console.log 'Finished loading, setting document.location.hash to "!foo"'
     document.location.hash = '!foo'
    page.open url, (status) ->
     console.log 'Finished loading, opening ' + url + '#!foo'
     document.open url + '#!foo', (status) ->
       console.log 'Finished loading with #!foo in call to page.open'
  2. If that's the case then pjscrape doesn't seem to be handling the #! links in my test page. Here's my test configuration:

    var url = 'http://fiddle.jshell.net/simonvwade/wpstb/12/show/'; // 'http://staging.eaauctions.com.au/test.html';
    
    pjs.config({
     // write each item to a file
     writer: 'itemfile',
     // just write the content that is returned from the scraper
     format: 'raw'
    });
    
    pjs.addSuite({
     // url to start at
     url: url,
     // selector to find more urls to spider
     moreUrls: function() {
       var urls = [];
       $('a[href*="#!"]').each(function() {
         // use this.href to get an absolute link
         urls.push(this.href);
       });
       return urls;
     },
     // no limit to depth
     maxDepth: null,
     // function to get some data
     scraper: function() {
       var url = document.location.hash.replace(/^#!/, '') || 'index'
         , file = 'static-cache/' + escape(url) + '.html';
       return {
         filename: file,
         content: document.documentElement.outerHTML
       }
     }
    });
nrabinowitz commented 12 years ago

It looks like you're trying to do an async operation (loading something via AJAX, based on the hash) in a synchronous way - the AJAX call won't have completed by the time you're trying to get the page content. It doesn't have to do with support for the hash, but with the async update.

Pjscrape has support for this with async scrapers. E.g.:

pjs.addSuite({
  url: 'http://gap.alexandriaarchive.org/gapvis/index.html#book/17',
  scraper: {
        async: true,
        scraper: function() {
            _pjs.waitForElement('h2.book-title', function() {
                _pjs.items = _pjs.getText('h2.book-title');
            });
        }
    }
});
simonexmachina commented 12 years ago

Great, thanks. I'll check that out.