ruipgil / scraperjs

A complete and versatile web scraper.
MIT License
3.7k stars 188 forks source link

ECONNRESET issue in Windows 7 - More frequently occuring #34

Closed vdraceil closed 9 years ago

vdraceil commented 9 years ago

I have written some NodeJS code with ScraperJS to scrape off from a website.. it runs perfectly in my OSX, but when run in Windows, it throws ECONNRESET err (the timing is random) almost all the time, I try to execute it.

D:\Documents\_Wealth\_PROJECTS\Hampers in London\_Data\famousbirthdays\node_modu
les\scraperjs\src\ScraperPromise.js:37
                throw err;
                      ^
Error: read ECONNRESET
    at exports._errnoException (util.js:746:11)
    at TCP.onread (net.js:559:26)

I even tried to wrap my router.route(...) method with async.eachLimit(urls, 2, function() {..}), but then, I'm not getting this ECONNRESET err, but the scraper itself ends smoothly without completing the task. This again works perfectly in OSX. Any idea why this is happening? Ideally it should have limited only the number of iterators running parallely, right?

ruipgil commented 9 years ago

Looks like it is an error unrelated with scraperjs. It seems like a node error. Check your node version, I'll be updating to node 0.12.x, in the mean time try to switch to node 0.11.16.

vdraceil commented 9 years ago

I did check with Node 0.11.16 on Windows 7 - the same issue persists :(

vdraceil commented 9 years ago

The ECONNRESET error abruptly ends the Node process just because it isn't caught/handled.

So, I figured a quick work around for this - I registered an error callback for the scraper (StaticScraper, in my case) and when the error is ECONNRESET, I re-route the current URL (because it hasn't been processed/scraped yet). This is not a fix, but this gets things working for me (at least for the time being).

var searchPageScraper = scraperjs.StaticScraper
  .create()
  .onError(function(err, utils) {
    if (err && err.code === 'ECONNRESET') {
      console.info('Re-routing URL:', utils.params.url);
      router.route(utils.params.url);
    }
  })
  .scrape(function($) {
    ...
  });

The one thing missing to this workaround is that - in the current code (_fire method) 'utils' is not bring passed to the error callback. 'utils' is required to look up what the current URL is and even possibly re-route.