Open framerate opened 8 years ago
When you say horseman exits, do you mean the node process dies or that the horseman instance stops working but node is still running?
Entire process dies. Sometimes it's the error above and sometimes it's 100% silent. Just trying to figure out my best course of action to debug/try+catch/something. Any direction is appreciated!
Have you tried running it with the environment variables DEBUG=horseman
and BLUEBIRD_DEBUG=1
? I would see what that gives you.
Is there a specific page you're hitting that causes this? I wonder if the error is in horseman or node-phantom-simple.
Well, it's more complicated I ear @johntitus
1) We're running on AWS, so it would seem many sites we scrape block AWS IP's by default 2) If they don't block AWS by default, they surely block it as DDoS if we run it too many times an hour ;)
The script SEEMS to be exiting when scraping a URL similar to: https://www.airbnb.com/users/show/148984
I'll try and reproduce, and @ThePatShea might be able to chime in as well.
I am also getting this issue, is there a work around for it?
I think this is more of node-phantom-simple issue. PhantomJS instance is actually killed when this error happens.
Any progress on this @framerate? I did some things in v3 that might help with this, but I'm not sure since I cannot reproduce this.
Testing v3... will give feedback soon
Okay... So I couldn't use the WIP v3 due to a few bugs, but I found a way to reproduce the bug.
Run the code for 20 mins and you should see it come up
const Horseman = require("node-horseman");
var browserIsReady = true;
function reachPage() {
var horseman = new Horseman();
console.log("Starting...");
horseman
.viewport(800, 600)
.open("https://www.facebook.com/bbcnews")
.status()
.then(function (status) {
if (Number(status) != 200) {
console.log("Couldn't load page, trying again...");
return horseman.close();
}
})
.wait(1e4)
.log("Stepping out...") // prints out the durl
.wait(3e3)
.finally(function() {
horseman.close();
setTimeout(reachPage, 3e3);
return
});
}
reachPage();
Thanks for the script @ohenepee, I should have time to try it myself this weekend.
Also, if you want to make issues for what you found wrong with v3 that'd be appreciated (though not necessary).
I've been running it for about a day now at it has yet to stop working for me...
Wow!... On Ubuntu or Windows?
Ubuntu
I just accidentally restarted the machine running it, but it hadn't had an error.
I met this kind of error too, randomly. I will set debug flag and capture to see what happens.
I think, sometimes phantomjs or node-phantom-simple do get error, this is acceptable. What I concern is, there should be a way to catch it, skip this failing url and go on. The situation is: we can't catch it, the whole node process died, leaving no chance for me to retry or keep track where I've been.
fix is required in node-phantom-simple.js, request_queue function. Add 403 to the error codes.
@mato75 ,You mean? 433 var req = http.request(http_opts, function (res) { 434 var err = res.statusCode === 500 ? true : false; 435 var data = '';
I should add a line after line 434? err = res.statusCode === 403?true:false;
var errorsCodes = [403, 500];
var req = http.request(http_opts, function (res) {
var err = errorsCodes.indexOf(res.statusCode) !== -1;
var data = '';
res.setEncoding('utf8');
res.on('data', function (chunk) {
data += chunk;
});
res.on('end', function () {
phantom.POSTING = false;
if (!data) {
// If method is exit - response may be empty, because server could be stopped while sending
if (method === 'exit') {
next();
callback();
return;
}
next();
callback(new HeadlessError('No response body for page.' + method + '()'));
return;
}
var results;
try {
results = JSON.parse(data).data;
} catch (error) {
// If method is exit - response may be broken, because server could be stopped while sending
if (method === 'exit') {
next();
callback();
return;
}
next();
callback(new HeadlessError('JSON parse on req with data: ' + data + ', error: ' + error + ', method: ' + method));
return;
}
if (err) {
next();
callback(results);
return;
}
if (method === 'createPage') {
var id = results.page_id;
var page = setup_new_page(id);
next();
callback(null, page);
return;
}
// Not createPage - just run the callback
next();
callback(null, results);
});
});
Thanks @mato75 ,I will try @awlayton I met this problem again, node process exit silently without any ... notice
The last log I got:
Start : http://www.amazon.com/gp/product/B00DQYX9ZK amazon.com 1863 horseman .setup() creating phantom instance on 13155 +7ms horseman setting viewport() to width 1280 height 2048 +2ms horseman phantom created. +110ms horseman page created +8ms horseman .open http://www.amazon.com/gp/product/B00DQYX9ZK +4ms
I found that if a page is too slow to download, horseman will exit after 30 seconds, seems to be a timeout failure. I can reproduce this error every time.
horseman .setup() creating phantom instance on 12406 +0ms horseman setting viewport() to width 1280 height 2048 +11ms horseman phantom created. +114ms horseman page created +24ms horseman .open http://item.jd.hk/1951236442.html +9ms horseman .open: http://item.jd.hk/1951236442.html - status: undefined +30s Unhandled rejection HeadlessError: Error parsing JSON from phantom: SyntaxError: Unexpected token E Data from phantom was: Error 403: Directory Listing Denied Directory listing denied at IncomingMessage.
(/home/work/node_modules/node-horseman/node_modules/node-phantom-simple/node-phantom-simple.js:614:14) at emitNone (events.js:72:20) at IncomingMessage.emit (events.js:166:7) at endReadableNT (_stream_readable.js:893:12) at doNTCallback2 (node.js:430:9) at process._tickCallback (node.js:344:17)
As for the 403, I have created an express server that responds with it. However I am still unable to reproduce the error @mato75 and @zhaorenjie (node does not exit and nothing is printed about an unhandled rejection).
Also @zhaorenjie, are you describing two difference errors with the two logs you posted? I am not entirely clear on what you are trying to report.
@awlayton They are two different errors I believe, all connected to this topic "HM exit when PhantomJS error". One is, it seems PhantomJS always exit silently after a 30 secs downloading, then HM exit, node exit with no further error / debug info reported. The other is, after applied @mato75 's mod to node-phantom-simple, I still get 403 errors and exits, but less often.
Could you post the output of npm ls
@zhaorenjie? I am still unable to reproduce the 403 error, and I am trying to figure out what is different. Thanks.
30s timeout is a watchdog_clear timeout in bridge.js (node-phantom-simple). You need to comment out process.exit(0);
This is used, when a page doesnt get any responses for 30s, phantomjs instance is closed. This is not ok, for when when you are running multiple pages in one phantomjs instance.
Perhaps make a request to node-phantom-simple to allow changing the watchdog timeout @mato75? That is not something to be fix from horseman unless such an option exists.
Can anyone verify if the original issue is still present in v3, and if so please post code to reproduce it? I have yet to be able to produce this bug myself.
I'm having a similar issue: I am scraping URLs and I am trying to make it gracefully handle URLs that timeout. Even after wrapping the code with a try/catch and adding an on error handler, the code still grinds to a halt fatally:
var Horseman = require('node-horseman');
var horseman = new Horseman();
try {
horseman.on('error', function() {
console.log('error event triggered');
})
horseman
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0")
.open('http://1.1.1.1:1234')
.then(function() {
console.log('URL Loaded');
})
.finally(function(){
horseman.close();
});
} catch (e) {
console.log("Caught error");
}
Unhandled rejection Error: Failed to GET url: http://1.1.1.1:1234
at checkStatus (/home/stephen/Downloads/communistcast/node_modules/node-horseman/lib/actions.js:78:16)
at PassThroughHandlerContext.finallyHandler (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/finally.js:56:23)
at PassThroughHandlerContext.tryCatcher (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/util.js:16:23)
at Promise._settlePromiseFromHandler (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/promise.js:502:31)
at Promise._settlePromise (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/promise.js:559:18)
at Promise._settlePromise0 (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/promise.js:604:10)
at Promise._settlePromises (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/promise.js:683:18)
at Promise._fulfill (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/promise.js:628:18)
at /home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/nodeback.js:42:21
at /home/stephen/Downloads/communistcast/node_modules/node-phantom-simple/node-phantom-simple.js:60:18
at IncomingMessage.<anonymous> (/home/stephen/Downloads/communistcast/node_modules/node-phantom-simple/node-phantom-simple.js:645:9)
at emitNone (events.js:72:20)
at IncomingMessage.emit (events.js:166:7)
at endReadableNT (_stream_readable.js:905:12)
at nextTickCallbackWith2Args (node.js:441:9)
at process._tickCallback (node.js:355:17)
Unhandled rejection Error: Failed to load url
at checkStatus (/home/stephen/Downloads/communistcast/node_modules/node-horseman/lib/index.js:276:16)
at tryCatcher (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/util.js:16:23)
at Function.Promise.attempt.Promise.try (/home/stephen/Downloads/communistcast/node_modules/bluebird/js/release/method.js:39:29)
at Object.loadFinishedSetup [as onLoadFinished] (/home/stephen/Downloads/communistcast/node_modules/node-horseman/lib/index.js:274:43)
at /home/stephen/Downloads/communistcast/node_modules/node-phantom-simple/node-phantom-simple.js:636:30
at Array.forEach (native)
at IncomingMessage.<anonymous> (/home/stephen/Downloads/communistcast/node_modules/node-phantom-simple/node-phantom-simple.js:617:17)
at emitNone (events.js:72:20)
at IncomingMessage.emit (events.js:166:7)
at endReadableNT (_stream_readable.js:905:12)
at nextTickCallbackWith2Args (node.js:441:9)
at process._tickCallback (node.js:355:17)
A try/catch won't handle rejections @stephen304, they handle things being thrown. Did you try .catch
on the Promise chain? That is how you handle rejections.
.catch can handle the first error, but it still exits with the second error, maybe I'm missing an extra place I need to .catch?
It's possible that a promise inside horseman somewhere is having an unhandled rejection. I can' think of one, but it's possible. If that is the case you can't .catch
it as a user of horseman.
Also, your stack traces are not too useful. You need to have BLUEBIRD_DEBUG=1
for them to be very helpful @stephen304.
@awlayton What's the specs of your test/development machine... this question might be too invasive but I realized that the PhantomJS executable actually crashes as a result of CPU max'ing (100%) upon a random launch or resource-intensive processing... I can confidently say from experience that no one should be getting this error on a quadcore with 4GB+ RAM... @awlayton you can create a virtual instance to test this... try a single core with 1-2GB RAM, then move up gradually with the number of cores and RAM size after some minutes of tests using the script I posted some time ago
I was running it on the machine at my desk which I recently rebuilt @ohenepee , so it's pretty powerful:
6 core i7 CPU
64 GB ram
So if the issue happen when running out processor power, I guess that won't happen on that computer. I have some deadline coming up at work, but some point in the future I should be able to test in a VM or something.
@stephen304 just place .catch in node-horseman/lib/index.js:273
To what version are you referring @Gazaret? In the current version, line 273 is this:
function loadFinishedSetup(status) {
@awlayton in this function, on line 280 reject. After reject need place .catch
That would defeat the purpose of the .reject()
@Gazaret. Inside that if
is an error condition. A .catch
at the end of the chain of horseman actions should catch that reject.
We're running a fairly complicated set of tests and the last few days for some reason they've been failing. Most of the time it prints the following error:
... and then exits. But sometimes it just plain exits with no logging (odd).
I'm wondering if there is a specific reason for this behavior, or maybe an undocumented option that can cause horseman to behave differently when something goes wrong in phantom?
Our ideal here is to have our series of tests continue even if one has an issue (and we'd like to log the issue we're having with as much info as possible). But the major hangup is the script exiting.
(note: I'm aware this could be user error, but I'm not 100% familiar with the script or horseman yet so any help is appreciated).