Hash fragments don't seem to be supported by PhantomJS

simonexmachina commented 12 years ago

I've posted this on the PhantomJS Google Group, but I thought I'd ask here as well in case you knew the answer.

Run the following script using: phantomjs test.coffee

page = require('webpage').create()
page.onResourceReceived = (response) ->
  console.log('Received ' + response.url)
url = 'http://fiddle.jshell.net/simonvwade/wpstb/11/show/'
page.open url, (status) ->
  console.log 'Finished loading'
  document.location.href = '#!foo'

The fiddle that is loaded (see http://jsfiddle.net/simonvwade/wpstb/11/) makes a request using AJAX whenever the hash fragment changes (try clicking the link). However this doesn't happen when document.location.href is called in PhantomJS.

I would expect to see "Received ...normalize.css" showing after "Finished loading"

nrabinowitz commented 12 years ago

I don't think document.location.href is the right way to switch hash tags. You may be able to set document.location.hash directly, or you can page.open the URL with its hash. Hash fragments do work in PhantomJS - I've tested several Backbone apps, and they work fine.

simonexmachina commented 12 years ago

Thanks Nick, appreciate the response. Two things:

Changing to use document.location.hash or including the hash in the call to page.open doesn't fix the problem:

page = require('webpage').create()
page.onResourceReceived = (response) ->
 console.log('Received ' + response.url)
url = 'http://fiddle.jshell.net/simonvwade/wpstb/11/show/'
page.open url, (status) ->
 console.log 'Finished loading, setting document.location.hash to "!foo"'
 document.location.hash = '!foo'
page.open url, (status) ->
 console.log 'Finished loading, opening ' + url + '#!foo'
 document.open url + '#!foo', (status) ->
   console.log 'Finished loading with #!foo in call to page.open'

If that's the case then pjscrape doesn't seem to be handling the #! links in my test page. Here's my test configuration:

var url = 'http://fiddle.jshell.net/simonvwade/wpstb/12/show/'; // 'http://staging.eaauctions.com.au/test.html';

pjs.config({
 // write each item to a file
 writer: 'itemfile',
 // just write the content that is returned from the scraper
 format: 'raw'
});

pjs.addSuite({
 // url to start at
 url: url,
 // selector to find more urls to spider
 moreUrls: function() {
   var urls = [];
   $('a[href*="#!"]').each(function() {
     // use this.href to get an absolute link
     urls.push(this.href);
   });
   return urls;
 },
 // no limit to depth
 maxDepth: null,
 // function to get some data
 scraper: function() {
   var url = document.location.hash.replace(/^#!/, '') || 'index'
     , file = 'static-cache/' + escape(url) + '.html';
   return {
     filename: file,
     content: document.documentElement.outerHTML
   }
 }
});

nrabinowitz commented 12 years ago

It looks like you're trying to do an async operation (loading something via AJAX, based on the hash) in a synchronous way - the AJAX call won't have completed by the time you're trying to get the page content. It doesn't have to do with support for the hash, but with the async update.

Pjscrape has support for this with async scrapers. E.g.:

pjs.addSuite({
  url: 'http://gap.alexandriaarchive.org/gapvis/index.html#book/17',
  scraper: {
        async: true,
        scraper: function() {
            _pjs.waitForElement('h2.book-title', function() {
                _pjs.items = _pjs.getText('h2.book-title');
            });
        }
    }
});

simonexmachina commented 12 years ago

Great, thanks. I'll check that out.

nrabinowitz / pjscrape

Hash fragments don't seem to be supported by PhantomJS #25