Closed newbreach closed 6 years ago
This is interesting. I have an app with some code looking like this:
function doSomething() {
promiseReturningFunction().then((data) => {
//do some
}).catch((error) => {
//log error
}).then(() => {
setTimeout(doSomething, 5000);
});
}
This is just meant to run these code blocks forever. Then suddenly a few installations stopped working recently. The app still responds to api requests and new timers that are set during for example an api call is executed.
However, the doSomething functions are no longer being exectuted. It is as if the event queue at some point just was emptied or something.
My installations that are having this problem are running node 10.4.1 or 10.6.0 depending on when they were upgraded.
A bug in the core nodejs is not my first thought but I cant find a logical explanation for my app to suddenly drop 5-6 timer constructs like this that has worked for a few years before....
@nodejs/timers
For what it's worth you are not supposed to rely on timers staying running for a month. That said, I haven't run into this sort of thing before - even on Node servers with an uptime of more than a month.
@tkarls
The app still responds to api requests and new timers that are set during for example an api call is executed.
It is very likely that promiseReturningFunction
never fulfilled or that you are returning a promise that was pending forever from either the catch
or the then
below it - or the then
below is never executed.
Also note that is not a great way to write Node.js code (and this is core's fault IMO). Are you using async_hooks and some form of instrumentation? Can you check the logs for unhandledRejection
s?
@benjamingr yea that would of course be the obvious conclusion and what I initially thought until I realized it wall all timers at once. I just can't figure out how that problem suddenly occurred on 6 different places in the code.
The "background tasks" are not depending on each other and hasn't been changed for a loong time. Had it been one timer that stopped that's one thing but all at the same time? This was also on 7 different servers running the same app in different locations and with a similar up-time. Like 3-4 weeks or so until the issue occurred.
Perhaps there are a better construct to keep some "background tasks" executing but I thought this was simple enough to do its job.
Not familiar with the async_hooks and logs are not saved (which is a flaw...) :(
@tkarls to be clear, I am not accusing you of writing code problematically - I am saying Node should guide users to safer ways to write code and make them "obvious".
If there is a way to get the "unhandledRejection"
warnings then there is a very high chance you'll see it there. I've seen this before several times:
I'm getting "off" timer land and into "promise debugging land" here - just saying this is entirely on our radar and something we're thinking and talking about improving. Also it's possible your problem was entirely different and we do have a bug in setInterval
- so take everything I say (guess) with a grain of salt.
Perhaps there are a better construct to keep some "background tasks" executing but I thought this was simple enough to do its job.
Typically, not on the server and using something that does scheduling - either what your cloud provides, cron
s if you're on a VM, k8s/cron jobs etc. I also warmly recommend turning on warnings (in production) for safety and using an APM (application performance monitoring) tool and setting up alerts for uptime and certain tasks.
Again, I strongly believe this is an issue with Node's user experience (not your or most people's code) and I feel strongly that Node (well, us) should fix it.
As tkarls says, I have a similar situation; @benjamingr thank you for your advice on coding, but I can't use cron because I use doSomething to trigger about 400-500 milliseconds.
@newbreach in a similar situation - what sort of code do you have in the setTimeout?
@benjamingr no problem, I didn't feel particularly 'accused' because it it was an easy fix I could do to my code I want to know about it so I can fix the problem :D
Thanks for your advises. Using external schedule trigging had not even occurred to me! Today the node process is kept alive using a systemd service that limits memory and tasks and restarts the node process if needed while ensuring always up. I'm sure there are more advanced monitors.
But back to the issue at hand. I hear what you are saying about unhandled rejections but I do have that catch before the then that sets the timer. So any error or exception should be caught I think (I have double checked that all promises are returned and not just created locally too).
One of my "workers" query an api over the network. Another checks a local database and talks to a local service. A third one is querying the NTP status on the server etc. Most of the functions are rather small and easy to overview. So I find it improbable (but I suppose not impossible) that all these functions have a bug in them that manifest at the same time. This happened on several servers during the summer (running 10.4.1-10.6.0) but not on any of our 60+ servers running <= 9.8.0. Granted our sw is also newer on the newer installations but a few of the "workers" hasn't been changed in well over a year.
The only thing that doesn't indicate a bug/change in setTimeout/setInterval to me is probably that there isn't 500 people reporting it already due to node being so popular. Which I can agree is a pretty strong indication too...
I do appreciate you insights and guesses though. That is all I'm doing right now too :)
@benjamingr
My code logic is probably like this
accept the HTTP request and send the message to the TCP client.
the request callback method is added to the queue.
receive the message of TCP client. If a reply is received, the callback method is removed from the queue and the callback is executed. The content of the callback method is "response.write" ({"success": true}).
timer checks whether the request message is more than 40 seconds. If it is over, remove the callback method and execute it.
The part of the code is as follows:
setInterval(function () {
let cbs = [];
for (let key of callbackQueue.keys()) {
let item = callbackQueue.get(key)
if (item && Date.now() - item.time >= 40000) {
callbackQueue.delete(key);
cbs.push(item);
}
}
for (let i in cbs) {
cbs[i].cb.call(cbs[i].thisArg, new util.BusinessError('network fail', 4001));
}
}, 200);
thanks!
Edit by @fishrock123 - code formatting
Click the mistake, I'm sorry
I'm trying to reproduce this (with NODE_DEBUG=*
, and the code setInterval(() => console.log('ping', Date.now()), 500);
) - I'll see if I can hit it after a day or two. Otherwise I'm not sure how I can do "time dilation" and run it faster in a VM - I'll look into that.
Edit: there appears to be a way to test it in virtualbox if anyone is interested in taking a stab sooner: https://superuser.com/a/1022317/248212
This just became a problem recently? Like, in Node 10 ... or something else?
I'm not seeing anything in the 10.x code that jumps out at me but will continue investigating and try to reproduce.
@Fishrock123 Well for me I have not experienced anything like this on node <= 9.8 The next version I have used after that is 10.4.1 and there I got this problem after a few weeks (3-4) of run-time on several servers.
Today I will try and attach a debugger to one of my failing servers. I said before that new timers set by setTimeout() in api calls were executed correctly. This may have been a premature conclusion. Because it looks like this is not the case in two other API calls. I don't seem to get an error thrown though since the API call ends with success. Just the scheduled task does not run.
I will add any info I get from the debugging here
Code is running with all the debug stuff for 2 days now here - not sure how to debug it otherwise - if anyone has any experience "speeding time" on an emulated system please chime in.
Ok, so I have successfully been able to connect the debugger over a ssh tunnel. I have 3 timers I can access via the api. 2 is not working in the sense that I can step the code and see that setTimeout() is indeed called with sensible values and does not throw. But the callback is never called. One callback is working as expected which is a bit strange.
I called process._getActiveHandles(); before the setTimeout of the working timer and got 21 handles. After the call to setTimeout() I got 22 handles and the last handle in the list is my new timer which I can see from the msecs value.
Doing the same thing on a setTimeout that is not working results is: before setTimeout I have 23 active handles. after setTimeout I have 23 active handles and the last one in the list is NOT my timer. Probably why it is never called.
I do have several timers in the list already with timer values that I do recognize and the only thing I can see is really different from the timer that is still working and the one that is not is the timeout value. The working timer uses a unique timeout value only used for that timer. The timers that isn't working is using the same timeout values as several of my "background timers" is using. Not sure if this is a coincidence or not... but I can also see in the debugger that one and exactly one background timer is still working. That one also has a unique timeout value that no other times has in the code. All other timers uses "common values" such as 3000, 5000, 10000 and 30000 and more than one timer uses each value.
sounds pretty strange to me...
That sounds like a rescheduling/reuse/re-enroll bug. (tl;dr some path causes pooled timers to not execute or schedule)
@apapirovski You're probably most familiar with this code now, any ideas?
Nothing off the top of my head. Will investigate.
@tkarls can you elaborate on "background timers"? Are these using unref
or something? Any chance either me or @Fishrock123 could get access to the codebase? It would make it easier to know what I should be looking for in our code.
@apapirovski With background timer I mean that It's basically a function that does some task and then reschedule itself using setTimeout(). Not using unref(), only setTimeout and sometimes clearTimeout
Like this:
function doSomething() {
promiseReturningFunction().then((data) => {
//do some
}).catch((error) => {
//log error
}).then(() => {
setTimeout(doSomething, 5000);
});
}
I call the function once when the app is lifted and then they are supposed to keep running without being poked from the outside. Hence "backgrund task" but I never use anything else than setTimeout to do the scheduling.
The api timer for instance is set in response to an api call. For instance I have a call that enables the firewall that sets an timer to auto disable it unless the user confirms the action within 30 seconds. This timer is not working. The relevant code is as follows;
enableFirewall: (options) => {
return getFwStatus().then((before) => {
return execp('sudo ufw --force enable').then(()=>{
setDisableTimer(disableTime);
}).then(()=>{
return getFwStatus().then((after) => {
options.before = before;
options.after = after;
return addAuditLogEntry(options);
});
});
}).then(()=>{
return disableTime;
}).catch( (error) => {
options.name = 'firewall-error';
return disableFirewall(options).catch(() => {
return execp('sudo ufw disable')
}).then(() => {
return Promise.reject(error);
});
});
}
function setDisableTimer(timeout) {
cancelDisableTimer();
disableFirewallTimer = setTimeout(() => {
disableFirewall({ user: '--disable timer--'});
}, timeout);
}
function cancelDisableTimer() {
if (disableFirewallTimer != null) {
clearTimeout(disableFirewallTimer);
disableFirewallTimer = null;
}
}
function disableFirewall(options) {
var options = options || { user: '--unknown user--' }
return getFwStatus().then((before) => {
return execp('sudo ufw disable').then(() => {
return getFwStatus().then((after) => {
options.before = before;
options.after = after;
return addAuditLogEntry(options);
});
});
});
}
The api call will call enableFirewall and I do see when stepping the code that it reaches and executes without error
disableFirewallTimer = setTimeout(() => {
disableFirewall({ user: '--disable timer--'});
}, timeout);
and that the timeout parameter indeed is set to 30000 as I expected it to be. But a breakpoint in the anonymous function never get hit (and the firewall doesn't get called either).
One of the "background tasks" that are not working any more
function keepNtpStatusFresh() {
getNtpStatusOutputInternal().then((data) => {
lastNtpCheck.data = data;
lastNtpCheck.when = new Date();
}).catch((error) => {
lastNtpCheck.data = null;
sails.log.error(error);
}).then(() => {
setTimeout(keepNtpStatusFresh, 5000);
});
}
function getNtpStatusOutputInternal() {
return execp(sails.config.paneda.ntpStatus, { timeout: 30000 }).then((result) => {
var re = new RegExp(/^(.)([^\s=]+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)/gm);
var match = re.exec(result.stdout);
var servers = [];
while (match != null) {
servers.push({
type: match[1],
remote: match[2],
refid: match[3],
st: Number(match[4]),
t: match[5],
when: Number(match[6]),
poll: Number(match[7]),
reach: Number(match[8]),
delay: Number(match[9]),
offset: Number(match[10]),
jitter: Number(match[11]),
});
match = re.exec(result.stdout);
}
return { output: result.stdout, serverList: servers };
});
}
ping 1533808667121
TIMER 61156: 500 list wait because diff is -1
TIMER 61156: timeout callback 500
TIMER 61156: now: 116841691
My test based on the above is still running for several days now by the way.
Yes, also since our servers were restarted they are working fine for little over 10 days now...
I have tried to check when the servers were upgraded and when the error showed up on each server. It looks like they all failed after about 25 days of run-time.
If this was a totally random problem or some sort of race condition I would have expected one server to fail after a few days, one to fail after a few weeks and perhaps 1 should still be running. 100% of the upgraded servers did fail. They are also running in completely different environments.
My brain can't help noticing that 25 days is almost exactly the number of milliseconds that can fit into a signed 32 bit integer (2 160 000 000 vs 2 147 483 647 ). Not sure if this is a pure coincidence (especially since Javascript doesn't have signed 32 bit integers AFAIK) but I'm mentioning it anyway :)
@tkarls as far as I know livuv uses 64bit ints for timers from what I can tell and Node just calls into libuv for checking.
Maybe @cjihrig or @saghul will know? (since libuv timers)
(We do cap the timeout itself to 32bit integers but not the number of repeats).
I wonder if this correlates with a libuv update - I'd try and bisect but for a bug that takes ~25 days to reproduce I don't think it'll help.
@apapirovski @benjamingr @Fishrock123
I was just able to reproduce this with a smaller example 😁
I used libfaketime to speed things up by a factor of 1000 on a linux server and executed a test program containing functions scheduling themselves in the callback and printing the current date. The simulated start date is today (2018-08-09) and the last print is in september 03 wich happened to be 25 days!
Here is the script for reference:
function timer1 () {
console.log('timer1' + new Date().toString());
setTimeout(timer1, 500);
}
function timer2 () {
console.log('timer2' + new Date().toString());
setTimeout(timer2, 500);
}
function timer3 () {
console.log('timer3' + new Date().toString());
setTimeout(timer3, 500);
}
function timer4 () {
console.log('timer4' + new Date().toString());
setTimeout(timer4, 500);
}
function timer5 () {
console.log('timer5' + new Date().toString());
setTimeout(timer5, 1500);
}
function timer6 () {
console.log('timer6' + new Date().toString());
setTimeout(timer6, 500);
}
function timer7 () {
console.log('timer7' + new Date().toString());
setTimeout(timer7, 500);
}
function timer8 () {
console.log('timer8' + new Date().toString());
setTimeout(timer8, 500);
}
function timer9 () {
console.log('timer9' + new Date().toString());
setTimeout(timer9, 500);
}
function timer10 () {
console.log('timer10' + new Date().toString());
setTimeout(timer10, 500);
}
timer1();
timer2();
timer3();
timer4();
timer5();
timer6();
timer7();
timer8();
timer9();
timer10();
In the middle of september 3 the output stopped. No error other output in the console.
Here is how I started it:
faketime -f "+10 x1000" node timer_test
@benjamingr So, libuv... I'm not at all familiar with that lib. But I suppose the bug could be there instead...
I think I know the bug... confirming right now. Ugh.
@tkarls excellent work isolating it. libuv is the underlying i/o library Node uses (it was written for Node but is used by others now) - it sounds like Anatoli has some idea about where the bug is though :)
Great that the cause is identified now 👍
I wonder why all these strange errors always happen to me 😂
I'm very happy. Don't worry about the boss getting angry! 😁Thx👍 @tkarls @benjamingr @apapirovski
The fix landed and should be out at the next release. Thanks for the report!
For what it's worth you are not supposed to rely on timers staying running for a month. That said, I haven't run into this sort of thing before
So much for having a gameserver be up for over a month! The absolute horror!
So much for having a gameserver be up for over a month! The absolute horror!
Note that what I meant by that isn't that your services aren't expected to be up for a month. It's that redundancy is expected (2+ instances behind a load balancer) so servers are resilient to failure.
There are plenty of cases where you need to restart your server (sceptre/meltdown style attacks) regardless of Node. I was just pointing out that relying on it being up for a month isn't something I'd personally recommend doing.
The fix landed and should be out at the next release
@benjamingr What version(s) contain(s) (or will contain) this fix?
Node v10.9.0
Just use setTimeout
instead and use recursion to call it again.
@halis Did you read the thread? 😂
Do we know what versions of Node this is affecting? We're currently on 9.x and was going to start implementing some polling using setTimeout
/setInterval
and i'm worried we might hit this issue.
Also do we know if it's only when using setInterval
for a long period of time? In my case we will use setInterval
along with clearInterval
and i'm curious if that prevents this issue from occurring.
Hi, We have some similar issue but with node 8. Does this version is impacted by this bug and will it get a fix in a minor version ? Thanks ❤️
Hi, We have some similar issue but with node 8. Does this version is impacted by this bug and will it get a fix in a minor version ? Thanks ❤️
升级到10.9.0以上吧,就是18年8月15日以后的版本都可以
This is fixed in Node 10.9.0
This is fixed in Node 10.9.0
@peernohell I guess we're not getting the answer to our questions 😆
I’ve got the same issue with node.js 12.x. Either setInterval or setTimeout unexpectedly stop without any error after some few weeks.
@tkarls Excellent job isolating the problem! @benjamingr Thanks for the fix!
We know that Node 10.9 fixes the issue, what about Node 12 and Node 14, can you please comment on those?
I am having the same issue with timeouts in Node 12.18.3 After a couple of days, it just stops working without any error. Doesn't seem to be 25 days though
@paynebc Any chance that you could reproduce this with libfaketime
as tkarls
did above?
https://github.com/nodejs/node/issues/22149#issuecomment-411781909
I will give this a try
FYI:
I ran my script above with libfaketime again and it does not detect any problem on node 12.18.3 Still running and scheduling stuff for "3 years" now.
Interesting, it must be a different problem then. I'm thinking about renting a VPS and running like 5-10 major versions of NodeJS in parallel with libfaketime
. Would that bring a value to anyone?
I tried the script above now, too. It does indeed run totally fine, so whatever it is I have must be of different origin. But it does show exactly the same symptoms: every couple days, my main loop just stops working. I am using the following function and calling it with await in my main loop instead of passing my mainloop function directly to setTimeout
exports.sleep = function(millis) { return new Promise(resolve => setTimeout(resolve, millis)); };
I am now running my application using NODE_DEBUG=* in the hopes of finding anything useful.
It's interesting because you are using setTimeout
as opposed to setInterval
in this topic. Do I understand correctly that this is what your code looks like?
while (true) {
await sleep(500);
// …
}
Are you absolutely sure that it is the sleep
function where your code hangs? (Such as by putting a console.log before and after the sleep function?) Perhaps could it be something else in your loop?
yes this is how it works. I now added some more logging to find out if this is indeed where it hangs. I will report back once I have an idea what it might be
There is a setInterval method in the express project. When the project has been running for a month or so, there is no error, the project does not stop, but the callback of the setInterval is no longer executed.
like this : setInterval(function(){ //do samething console.log(Date.now()); },500); A month later, no output is available, but the application process is robust.