nrabinowitz / pjscrape

A web-scraping framework written in Javascript, using PhantomJS and jQuery
http://nrabinowitz.github.io/pjscrape/
MIT License
997 stars 159 forks source link

Failing on some multiple URLs #13

Closed justnom closed 12 years ago

justnom commented 12 years ago

Overview

I am trying to create a very basic scraper with PhantomJS and pjscrape framework.

My Code

pjs.config({
timeoutInterval: 6000,
timeoutLimit: 10000,
format: 'csv', 
csvFields: ['productTitle','price'],
writer: 'file', 
outFile: 'D:\\prod_details.csv'
});

pjs.addSuite({
title: 'ChainReactionCycles Scraper',
url: productURLs, //This is an array of URLs, two example are defined below 
scrapers: [
function() {
    var results [];
    var linkTitle = _pjs.getText('#ModelsDisplayStyle4_LblTitle');
    var linkPrice = _pjs.getText('#ModelsDisplayStyle4_LblMinPrice');
    results.push([linkTitle[0],linkPrice[0]]); 
    return results;
}
]
});

URL Array's Used

This first array DOES NOT WORK and fails after the 3rd or 4th URL.

var productURLs = ["8649","17374","7327","7325","14892","8650","8651","14893","18090","51318"];
for(var i=0;i<productURLs.length;++i){
  productURLs[i] = 'http://www.chainreactioncycles.com/Models.aspx?ModelID=' + productURLs[i];
}

This second array WORKS and does not fail, even though it is from the same site.

var categoriesURLs = ["304","2420","965","518","514","1667","521","1302","1138","510"];
for(var i=0;i<categoriesURLs.length;++i){
categoriesURLs[i] = 'http://www.chainreactioncycles.com/Categories.aspx?CategoryID=' + categoriesURLs[i];
}

Problem

When iterating through productURLs the PhantomJS page.open optional callback automatically assumes failure. Even when the page hasn't finished loading.

I know this as I started the script up while running an HTTP debugger and the HTTP request were still running even after PhantomJS had reported a a page load failure.

However, the code works fine when running with categoriesURLs.

Assumptions

  1. All the URL's listed above are VALID
  2. I have the latest versions of both PhantomJS and pjscrape

    Possible Solutions

These are solutions I have tried thus far.

  1. Disabling image loading page.options.loadImages = false
  2. Settings a larger timeoutInterval in pjs.config this was not useful apparently as the error generated was of a page.open failure and NOT a timeout failure.

Any ideas?

justnom commented 12 years ago

http://stackoverflow.com/questions/9647277/phantomjs-and-pjscrape-failing-on-some-multiple-urls Above is the stack overflow link as well.

nrabinowitz commented 12 years ago

Could not reproduce. I just ran this several times, and was able to retrieve all of the URLs in the productUrls list.

justnom commented 12 years ago

Okay - thank you for trying. I might have to code this using another framework.

nrabinowitz commented 12 years ago

Up to you :). Honestly, the weakness of pjscrape is that it depends on the stability of PhantomJS, which is still a work in progress. This sounds much more likely to be a PhantomJS issue.

justnom commented 12 years ago

I was thinking about modding your framework to work with a custom Chromium build. If you would be alright? From: Nick Rabinowitz Sent: 13/03/2012 19:45 To: justnom Subject: Re: [pjscrape] Failing on some multiple URLs (#13) Up to you :). Honestly, the weakness of pjscrape is that it depends on the stability of PhantomJS, which is still a work in progress. This sounds much more likely to be a PhantomJS issue.


Reply to this email directly or view it on GitHub: https://github.com/nrabinowitz/pjscrape/issues/13#issuecomment-4484316

nrabinowitz commented 12 years ago

It's Github! Fork away. But it sounds like you're aiming to recreate PhantomJS by building on Chromium, which might be a heck of a project, especially if you want it to be headless.

justnom commented 12 years ago

Yeah, that's basically what I would be doing, but would have a "preview" window so I can visually see the scraping. I don't need pictures to load, so cut down render time, but I will do a few tests for speed first just to confirm that it's going to be okay. I honestly don't think that it would be that bad, just registering a Chromium extension with the current context and binding that back to some native code to run the external JS when rendering has finished. I say that rather optimistically however!

On 13 March 2012 20:12, Nick Rabinowitz < reply@reply.github.com

wrote:

It's Github! Fork away. But it sounds like you're aiming to recreate PhantomJS by building on Chromium, which might be a heck of a project, especially if you want it to be headless.


Reply to this email directly or view it on GitHub: https://github.com/nrabinowitz/pjscrape/issues/13#issuecomment-4485051

pjgoncalves commented 11 years ago

where is the url array being executed? i've tried to put it inside pjs.addSuite but that didnt work out. any tips ?