johntitus / node-horseman

Run PhantomJS from Node
MIT License
1.45k stars 124 forks source link

Horseman exiting when PhantomJS has an error #87

Open framerate opened 9 years ago

framerate commented 9 years ago

We're running a fairly complicated set of tests and the last few days for some reason they've been failing. Most of the time it prints the following error:

{ [HeadlessError: Error parsing JSON from phantom: SyntaxError: Unexpected token E
Data from phantom was: Error 403: Directory Listing Denied
Directory listing denied]
  name: 'HeadlessError',
  message: 'Error parsing JSON from phantom: SyntaxError: Unexpected token E\nData from phantom was: Error 403: Directory Listing Denied\nDirectory listing denied' }

... and then exits. But sometimes it just plain exits with no logging (odd).

I'm wondering if there is a specific reason for this behavior, or maybe an undocumented option that can cause horseman to behave differently when something goes wrong in phantom?

Our ideal here is to have our series of tests continue even if one has an issue (and we'd like to log the issue we're having with as much info as possible). But the major hangup is the script exiting.

(note: I'm aware this could be user error, but I'm not 100% familiar with the script or horseman yet so any help is appreciated).

awlayton commented 9 years ago

When you say horseman exits, do you mean the node process dies or that the horseman instance stops working but node is still running?

framerate commented 9 years ago

Entire process dies. Sometimes it's the error above and sometimes it's 100% silent. Just trying to figure out my best course of action to debug/try+catch/something. Any direction is appreciated!

awlayton commented 9 years ago

Have you tried running it with the environment variables DEBUG=horseman and BLUEBIRD_DEBUG=1? I would see what that gives you.

johntitus commented 9 years ago

Is there a specific page you're hitting that causes this? I wonder if the error is in horseman or node-phantom-simple.

framerate commented 9 years ago

Well, it's more complicated I ear @johntitus

1) We're running on AWS, so it would seem many sites we scrape block AWS IP's by default 2) If they don't block AWS by default, they surely block it as DDoS if we run it too many times an hour ;)

The script SEEMS to be exiting when scraping a URL similar to: https://www.airbnb.com/users/show/148984

I'll try and reproduce, and @ThePatShea might be able to chime in as well.

MaestroJurko commented 8 years ago

I am also getting this issue, is there a work around for it?

I think this is more of node-phantom-simple issue. PhantomJS instance is actually killed when this error happens.

MaestroJurko commented 8 years ago

https://github.com/baudehlo/node-phantom-simple/issues/109

awlayton commented 8 years ago

Any progress on this @framerate? I did some things in v3 that might help with this, but I'm not sure since I cannot reproduce this.

ohenepee commented 8 years ago

Testing v3... will give feedback soon

ohenepee commented 8 years ago

Okay... So I couldn't use the WIP v3 due to a few bugs, but I found a way to reproduce the bug.

Run the code for 20 mins and you should see it come up

const Horseman = require("node-horseman");

var browserIsReady = true;

function reachPage() {

  var horseman = new Horseman();
  console.log("Starting...");

  horseman
  .viewport(800, 600)
  .open("https://www.facebook.com/bbcnews")
  .status()
  .then(function (status) {
    if (Number(status) != 200) {
      console.log("Couldn't load page, trying again...");
      return horseman.close();
    }
  })
  .wait(1e4)
  .log("Stepping out...") // prints out the durl
  .wait(3e3)
  .finally(function() {
    horseman.close();
    setTimeout(reachPage, 3e3);
    return
  });

}

reachPage();

2016-03-10-111941_659x314_scrot

awlayton commented 8 years ago

Thanks for the script @ohenepee, I should have time to try it myself this weekend.

Also, if you want to make issues for what you found wrong with v3 that'd be appreciated (though not necessary).

awlayton commented 8 years ago

I've been running it for about a day now at it has yet to stop working for me...

ohenepee commented 8 years ago

Wow!... On Ubuntu or Windows?

awlayton commented 8 years ago

Ubuntu

awlayton commented 8 years ago

I just accidentally restarted the machine running it, but it hadn't had an error.

zhaorenjie commented 8 years ago

I met this kind of error too, randomly. I will set debug flag and capture to see what happens.

I think, sometimes phantomjs or node-phantom-simple do get error, this is acceptable. What I concern is, there should be a way to catch it, skip this failing url and go on. The situation is: we can't catch it, the whole node process died, leaving no chance for me to retry or keep track where I've been.

MaestroJurko commented 8 years ago

fix is required in node-phantom-simple.js, request_queue function. Add 403 to the error codes.

zhaorenjie commented 8 years ago

@mato75 ,You mean? 433 var req = http.request(http_opts, function (res) { 434 var err = res.statusCode === 500 ? true : false; 435 var data = '';

I should add a line after line 434? err = res.statusCode === 403?true:false;

MaestroJurko commented 8 years ago
var errorsCodes = [403, 500];
      var req = http.request(http_opts, function (res) {
        var err = errorsCodes.indexOf(res.statusCode) !== -1;
        var data = '';

        res.setEncoding('utf8');

        res.on('data', function (chunk) {
          data += chunk;
        });

        res.on('end', function () {
          phantom.POSTING = false;

          if (!data) {
            // If method is exit - response may be empty, because server could be stopped while sending
            if (method === 'exit') {
              next();
              callback();
              return;
            }

            next();
            callback(new HeadlessError('No response body for page.' + method + '()'));
            return;
          }

          var results;

          try {
            results = JSON.parse(data).data;
          } catch (error) {
            // If method is exit - response may be broken, because server could be stopped while sending
            if (method === 'exit') {
              next();
              callback();
              return;
            }

            next();
            callback(new HeadlessError('JSON parse on req with data: ' + data + ', error: ' + error + ', method: ' + method));
            return;
          }

          if (err) {
            next();
            callback(results);
            return;
          }

          if (method === 'createPage') {
            var id = results.page_id;
            var page = setup_new_page(id);

            next();
            callback(null, page);
            return;
          }

          // Not createPage - just run the callback
          next();
          callback(null, results);
        });
      });
zhaorenjie commented 8 years ago

Thanks @mato75 ,I will try @awlayton I met this problem again, node process exit silently without any ... notice

The last log I got:

Start : http://www.amazon.com/gp/product/B00DQYX9ZK amazon.com 1863 horseman .setup() creating phantom instance on 13155 +7ms horseman setting viewport() to width 1280 height 2048 +2ms horseman phantom created. +110ms horseman page created +8ms horseman .open http://www.amazon.com/gp/product/B00DQYX9ZK +4ms

I found that if a page is too slow to download, horseman will exit after 30 seconds, seems to be a timeout failure. I can reproduce this error every time.

horseman .setup() creating phantom instance on 12406 +0ms horseman setting viewport() to width 1280 height 2048 +11ms horseman phantom created. +114ms horseman page created +24ms horseman .open http://item.jd.hk/1951236442.html +9ms horseman .open: http://item.jd.hk/1951236442.html - status: undefined +30s Unhandled rejection HeadlessError: Error parsing JSON from phantom: SyntaxError: Unexpected token E Data from phantom was: Error 403: Directory Listing Denied Directory listing denied at IncomingMessage. (/home/work/node_modules/node-horseman/node_modules/node-phantom-simple/node-phantom-simple.js:614:14) at emitNone (events.js:72:20) at IncomingMessage.emit (events.js:166:7) at endReadableNT (_stream_readable.js:893:12) at doNTCallback2 (node.js:430:9) at process._tickCallback (node.js:344:17)

awlayton commented 8 years ago

As for the 403, I have created an express server that responds with it. However I am still unable to reproduce the error @mato75 and @zhaorenjie (node does not exit and nothing is printed about an unhandled rejection).

Also @zhaorenjie, are you describing two difference errors with the two logs you posted? I am not entirely clear on what you are trying to report.

zhaorenjie commented 8 years ago

@awlayton They are two different errors I believe, all connected to this topic "HM exit when PhantomJS error". One is, it seems PhantomJS always exit silently after a 30 secs downloading, then HM exit, node exit with no further error / debug info reported. The other is, after applied @mato75 's mod to node-phantom-simple, I still get 403 errors and exits, but less often.

awlayton commented 8 years ago

Could you post the output of npm ls @zhaorenjie? I am still unable to reproduce the 403 error, and I am trying to figure out what is different. Thanks.

MaestroJurko commented 8 years ago

30s timeout is a watchdog_clear timeout in bridge.js (node-phantom-simple). You need to comment out process.exit(0);

This is used, when a page doesnt get any responses for 30s, phantomjs instance is closed. This is not ok, for when when you are running multiple pages in one phantomjs instance.

awlayton commented 8 years ago

Perhaps make a request to node-phantom-simple to allow changing the watchdog timeout @mato75? That is not something to be fix from horseman unless such an option exists.

MaestroJurko commented 8 years ago

https://github.com/baudehlo/node-phantom-simple/issues/124

awlayton commented 8 years ago

Can anyone verify if the original issue is still present in v3, and if so please post code to reproduce it? I have yet to be able to produce this bug myself.

stephen304 commented 8 years ago

I'm having a similar issue: I am scraping URLs and I am trying to make it gracefully handle URLs that timeout. Even after wrapping the code with a try/catch and adding an on error handler, the code still grinds to a halt fatally:

var Horseman = require('node-horseman');
var horseman = new Horseman();

try {
    horseman.on('error', function() {
        console.log('error event triggered');
    })

    horseman
        .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0")
        .open('http://1.1.1.1:1234')
        .then(function() {
            console.log('URL Loaded');
        })
        .finally(function(){
        horseman.close();
        });
} catch (e) {
    console.log("Caught error");
}
Unhandled rejection Error: Failed to GET url: http://1.1.1.1:1234
    at checkStatus (/home/stephen/Downloads/communistcast/node_modules/node-horseman/lib/actions.js:78:16)
    at PassThroughHandlerContext.finallyHandler (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/finally.js:56:23)
    at PassThroughHandlerContext.tryCatcher (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/promise.js:502:31)
    at Promise._settlePromise (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/promise.js:559:18)
    at Promise._settlePromise0 (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/promise.js:604:10)
    at Promise._settlePromises (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/promise.js:683:18)
    at Promise._fulfill (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/promise.js:628:18)
    at /home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/nodeback.js:42:21
    at /home/stephen/Downloads/communistcast/node_modules/node-phantom-simple/node-phantom-simple.js:60:18
    at IncomingMessage.<anonymous> (/home/stephen/Downloads/communistcast/node_modules/node-phantom-simple/node-phantom-simple.js:645:9)
    at emitNone (events.js:72:20)
    at IncomingMessage.emit (events.js:166:7)
    at endReadableNT (_stream_readable.js:905:12)
    at nextTickCallbackWith2Args (node.js:441:9)
    at process._tickCallback (node.js:355:17)

Unhandled rejection Error: Failed to load url
    at checkStatus (/home/stephen/Downloads/communistcast/node_modules/node-horseman/lib/index.js:276:16)
    at tryCatcher (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/util.js:16:23)
    at Function.Promise.attempt.Promise.try (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/method.js:39:29)
    at Object.loadFinishedSetup [as onLoadFinished] (/home/stephen/Downloads/communistcast/node_modules/node-horseman/lib/index.js:274:43)
    at /home/stephen/Downloads/communistcast/node_modules/node-phantom-simple/node-phantom-simple.js:636:30
    at Array.forEach (native)
    at IncomingMessage.<anonymous> (/home/stephen/Downloads/communistcast/node_modules/node-phantom-simple/node-phantom-simple.js:617:17)
    at emitNone (events.js:72:20)
    at IncomingMessage.emit (events.js:166:7)
    at endReadableNT (_stream_readable.js:905:12)
    at nextTickCallbackWith2Args (node.js:441:9)
    at process._tickCallback (node.js:355:17)
awlayton commented 8 years ago

A try/catch won't handle rejections @stephen304, they handle things being thrown. Did you try .catch on the Promise chain? That is how you handle rejections.

stephen304 commented 8 years ago

.catch can handle the first error, but it still exits with the second error, maybe I'm missing an extra place I need to .catch?

awlayton commented 8 years ago

It's possible that a promise inside horseman somewhere is having an unhandled rejection. I can' think of one, but it's possible. If that is the case you can't .catch it as a user of horseman.

Also, your stack traces are not too useful. You need to have BLUEBIRD_DEBUG=1 for them to be very helpful @stephen304.

ohenepee commented 8 years ago

@awlayton What's the specs of your test/development machine... this question might be too invasive but I realized that the PhantomJS executable actually crashes as a result of CPU max'ing (100%) upon a random launch or resource-intensive processing... I can confidently say from experience that no one should be getting this error on a quadcore with 4GB+ RAM... @awlayton you can create a virtual instance to test this... try a single core with 1-2GB RAM, then move up gradually with the number of cores and RAM size after some minutes of tests using the script I posted some time ago

awlayton commented 8 years ago

I was running it on the machine at my desk which I recently rebuilt @ohenepee , so it's pretty powerful:

6 core i7 CPU
64 GB ram

So if the issue happen when running out processor power, I guess that won't happen on that computer. I have some deadline coming up at work, but some point in the future I should be able to test in a VM or something.

gazaret commented 8 years ago

@stephen304 just place .catch in node-horseman/lib/index.js:273

awlayton commented 8 years ago

To what version are you referring @Gazaret? In the current version, line 273 is this:

        function loadFinishedSetup(status) {
gazaret commented 8 years ago

@awlayton in this function, on line 280 reject. After reject need place .catch

awlayton commented 8 years ago

That would defeat the purpose of the .reject() @Gazaret. Inside that if is an error condition. A .catch at the end of the chain of horseman actions should catch that reject.