nrabinowitz / pjscrape

A web-scraping framework written in Javascript, using PhantomJS and jQuery
http://nrabinowitz.github.io/pjscrape/
MIT License
997 stars 159 forks source link

Scraping event called multiple times #28

Open dmarafetti opened 11 years ago

dmarafetti commented 11 years ago

Due an issue with PanthomJS (I've tested on 1.7.0 in both MacOSX and Debian 6) issue 353, the page.open() event is being called multiple times on some url's. This is related to iframes being created within the page (you can find more details in the open issue).

pjscrape.js (master branch)

line 680 // run the scrape line 681 page.open(url, function(status) {

Below you can see an output example of how the log looks like when scraping is invoked many times:

xxxxx@ip-xxxxxxxxx:~/crawler$ phantomjs   --web-security=no --load-images=no --ignore-ssl-errors=yes ./pjscrape-600e20a/pjscrape.js  ./bin/pjscrape-600e20a/pjscrape.js ./config.js

Using config file: src/main/resources/com/apicube/crawler/pjscrape/config.js
* Suite 0 starting
* Opening http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Scraping http://www.lemonde.fr/planete/article/2012/11/29/eolien-ou-gaz-de-schiste-le-debat-sur-l-energie-debute-aujourd-hui_1797825_3244.html
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items
* Suite 0 complete
* Writing 1 items
* Saved 1 items

I've applied a workaround in the meantime in order to stop duplicated events. I use the visited array to know if that page was already visited. I added a condition before line 700 as you can see below:

pjscrape.js (master branch) line 700

                   if(visited[url]) {

                        log.msg('Page recalled: ' + url);
                        return;
                    }

                   // mark as visited
                  visited[url] = true;

Hope this help to fix this bug. Diego