nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.51k stars 29.56k forks source link

setInterval callback function unexpected halt #22149

Closed newbreach closed 6 years ago

newbreach commented 6 years ago

There is a setInterval method in the express project. When the project has been running for a month or so, there is no error, the project does not stop, but the callback of the setInterval is no longer executed.

like this : setInterval(function(){ //do samething console.log(Date.now()); },500); A month later, no output is available, but the application process is robust.

tkarls commented 6 years ago

This is interesting. I have an app with some code looking like this:

function doSomething() {
    promiseReturningFunction().then((data) => {
        //do some
    }).catch((error) => {
        //log error
    }).then(() => {
        setTimeout(doSomething, 5000);
    });
}

This is just meant to run these code blocks forever. Then suddenly a few installations stopped working recently. The app still responds to api requests and new timers that are set during for example an api call is executed.

However, the doSomething functions are no longer being exectuted. It is as if the event queue at some point just was emptied or something.

My installations that are having this problem are running node 10.4.1 or 10.6.0 depending on when they were upgraded.

A bug in the core nodejs is not my first thought but I cant find a logical explanation for my app to suddenly drop 5-6 timer constructs like this that has worked for a few years before....

targos commented 6 years ago

@nodejs/timers

benjamingr commented 6 years ago

For what it's worth you are not supposed to rely on timers staying running for a month. That said, I haven't run into this sort of thing before - even on Node servers with an uptime of more than a month.

@tkarls

The app still responds to api requests and new timers that are set during for example an api call is executed.

It is very likely that promiseReturningFunction never fulfilled or that you are returning a promise that was pending forever from either the catch or the then below it - or the then below is never executed.

Also note that is not a great way to write Node.js code (and this is core's fault IMO). Are you using async_hooks and some form of instrumentation? Can you check the logs for unhandledRejections?

tkarls commented 6 years ago

@benjamingr yea that would of course be the obvious conclusion and what I initially thought until I realized it wall all timers at once. I just can't figure out how that problem suddenly occurred on 6 different places in the code.

The "background tasks" are not depending on each other and hasn't been changed for a loong time. Had it been one timer that stopped that's one thing but all at the same time? This was also on 7 different servers running the same app in different locations and with a similar up-time. Like 3-4 weeks or so until the issue occurred.

Perhaps there are a better construct to keep some "background tasks" executing but I thought this was simple enough to do its job.

Not familiar with the async_hooks and logs are not saved (which is a flaw...) :(

benjamingr commented 6 years ago

@tkarls to be clear, I am not accusing you of writing code problematically - I am saying Node should guide users to safer ways to write code and make them "obvious".

If there is a way to get the "unhandledRejection" warnings then there is a very high chance you'll see it there. I've seen this before several times:

I'm getting "off" timer land and into "promise debugging land" here - just saying this is entirely on our radar and something we're thinking and talking about improving. Also it's possible your problem was entirely different and we do have a bug in setInterval - so take everything I say (guess) with a grain of salt.

Perhaps there are a better construct to keep some "background tasks" executing but I thought this was simple enough to do its job.

Typically, not on the server and using something that does scheduling - either what your cloud provides, crons if you're on a VM, k8s/cron jobs etc. I also warmly recommend turning on warnings (in production) for safety and using an APM (application performance monitoring) tool and setting up alerts for uptime and certain tasks.

Again, I strongly believe this is an issue with Node's user experience (not your or most people's code) and I feel strongly that Node (well, us) should fix it.

newbreach commented 6 years ago

As tkarls says, I have a similar situation; @benjamingr thank you for your advice on coding, but I can't use cron because I use doSomething to trigger about 400-500 milliseconds.

benjamingr commented 6 years ago

@newbreach in a similar situation - what sort of code do you have in the setTimeout?

tkarls commented 6 years ago

@benjamingr no problem, I didn't feel particularly 'accused' because it it was an easy fix I could do to my code I want to know about it so I can fix the problem :D

Thanks for your advises. Using external schedule trigging had not even occurred to me! Today the node process is kept alive using a systemd service that limits memory and tasks and restarts the node process if needed while ensuring always up. I'm sure there are more advanced monitors.

But back to the issue at hand. I hear what you are saying about unhandled rejections but I do have that catch before the then that sets the timer. So any error or exception should be caught I think (I have double checked that all promises are returned and not just created locally too).

One of my "workers" query an api over the network. Another checks a local database and talks to a local service. A third one is querying the NTP status on the server etc. Most of the functions are rather small and easy to overview. So I find it improbable (but I suppose not impossible) that all these functions have a bug in them that manifest at the same time. This happened on several servers during the summer (running 10.4.1-10.6.0) but not on any of our 60+ servers running <= 9.8.0. Granted our sw is also newer on the newer installations but a few of the "workers" hasn't been changed in well over a year.

The only thing that doesn't indicate a bug/change in setTimeout/setInterval to me is probably that there isn't 500 people reporting it already due to node being so popular. Which I can agree is a pretty strong indication too...

I do appreciate you insights and guesses though. That is all I'm doing right now too :)

newbreach commented 6 years ago

@benjamingr

My code logic is probably like this

  1. accept the HTTP request and send the message to the TCP client.

  2. the request callback method is added to the queue.

  3. receive the message of TCP client. If a reply is received, the callback method is removed from the queue and the callback is executed. The content of the callback method is "response.write" ({"success": true}).

  4. timer checks whether the request message is more than 40 seconds. If it is over, remove the callback method and execute it.

The part of the code is as follows:

setInterval(function () {
  let cbs = [];
  for (let key of callbackQueue.keys()) {
    let item = callbackQueue.get(key)
    if (item && Date.now() - item.time >= 40000) {
      callbackQueue.delete(key);
      cbs.push(item);
    }
  }
  for (let i in cbs) {
    cbs[i].cb.call(cbs[i].thisArg, new util.BusinessError('network fail', 4001));
  }
}, 200);

thanks!

Edit by @fishrock123 - code formatting

newbreach commented 6 years ago

Click the mistake, I'm sorry

benjamingr commented 6 years ago

I'm trying to reproduce this (with NODE_DEBUG=*, and the code setInterval(() => console.log('ping', Date.now()), 500);) - I'll see if I can hit it after a day or two. Otherwise I'm not sure how I can do "time dilation" and run it faster in a VM - I'll look into that.

Edit: there appears to be a way to test it in virtualbox if anyone is interested in taking a stab sooner: https://superuser.com/a/1022317/248212

Fishrock123 commented 6 years ago

This just became a problem recently? Like, in Node 10 ... or something else?

apapirovski commented 6 years ago

I'm not seeing anything in the 10.x code that jumps out at me but will continue investigating and try to reproduce.

tkarls commented 6 years ago

@Fishrock123 Well for me I have not experienced anything like this on node <= 9.8 The next version I have used after that is 10.4.1 and there I got this problem after a few weeks (3-4) of run-time on several servers.

Today I will try and attach a debugger to one of my failing servers. I said before that new timers set by setTimeout() in api calls were executed correctly. This may have been a premature conclusion. Because it looks like this is not the case in two other API calls. I don't seem to get an error thrown though since the API call ends with success. Just the scheduled task does not run.

I will add any info I get from the debugging here

benjamingr commented 6 years ago

Code is running with all the debug stuff for 2 days now here - not sure how to debug it otherwise - if anyone has any experience "speeding time" on an emulated system please chime in.

tkarls commented 6 years ago

Ok, so I have successfully been able to connect the debugger over a ssh tunnel. I have 3 timers I can access via the api. 2 is not working in the sense that I can step the code and see that setTimeout() is indeed called with sensible values and does not throw. But the callback is never called. One callback is working as expected which is a bit strange.

I called process._getActiveHandles(); before the setTimeout of the working timer and got 21 handles. After the call to setTimeout() I got 22 handles and the last handle in the list is my new timer which I can see from the msecs value.

Doing the same thing on a setTimeout that is not working results is: before setTimeout I have 23 active handles. after setTimeout I have 23 active handles and the last one in the list is NOT my timer. Probably why it is never called.

I do have several timers in the list already with timer values that I do recognize and the only thing I can see is really different from the timer that is still working and the one that is not is the timeout value. The working timer uses a unique timeout value only used for that timer. The timers that isn't working is using the same timeout values as several of my "background timers" is using. Not sure if this is a coincidence or not... but I can also see in the debugger that one and exactly one background timer is still working. That one also has a unique timeout value that no other times has in the code. All other timers uses "common values" such as 3000, 5000, 10000 and 30000 and more than one timer uses each value.

sounds pretty strange to me...

Fishrock123 commented 6 years ago

That sounds like a rescheduling/reuse/re-enroll bug. (tl;dr some path causes pooled timers to not execute or schedule)

@apapirovski You're probably most familiar with this code now, any ideas?

apapirovski commented 6 years ago

Nothing off the top of my head. Will investigate.

apapirovski commented 6 years ago

@tkarls can you elaborate on "background timers"? Are these using unref or something? Any chance either me or @Fishrock123 could get access to the codebase? It would make it easier to know what I should be looking for in our code.

tkarls commented 6 years ago

@apapirovski With background timer I mean that It's basically a function that does some task and then reschedule itself using setTimeout(). Not using unref(), only setTimeout and sometimes clearTimeout

Like this:

function doSomething() {
    promiseReturningFunction().then((data) => {
        //do some
    }).catch((error) => {
        //log error
    }).then(() => {
        setTimeout(doSomething, 5000);
    });
}

I call the function once when the app is lifted and then they are supposed to keep running without being poked from the outside. Hence "backgrund task" but I never use anything else than setTimeout to do the scheduling.

The api timer for instance is set in response to an api call. For instance I have a call that enables the firewall that sets an timer to auto disable it unless the user confirms the action within 30 seconds. This timer is not working. The relevant code is as follows;

enableFirewall: (options) => {
        return getFwStatus().then((before) => {
            return execp('sudo ufw --force enable').then(()=>{
                setDisableTimer(disableTime);
            }).then(()=>{
                return getFwStatus().then((after) => {
                    options.before = before;
                    options.after = after;
                    return addAuditLogEntry(options);
                }); 
            });
        }).then(()=>{
            return disableTime;
        }).catch( (error) => {
            options.name = 'firewall-error';
            return disableFirewall(options).catch(() => {
                return execp('sudo ufw disable')
            }).then(() => {
                return Promise.reject(error);
            });
        });
    }

function setDisableTimer(timeout) {
    cancelDisableTimer();
    disableFirewallTimer = setTimeout(() => {
        disableFirewall({ user: '--disable timer--'});
    }, timeout);
}

function cancelDisableTimer() {
    if (disableFirewallTimer != null) {
        clearTimeout(disableFirewallTimer);
        disableFirewallTimer = null;
    }
}

function disableFirewall(options) {
    var options = options || { user: '--unknown user--' }
    return getFwStatus().then((before) => {
        return execp('sudo ufw disable').then(() => {
            return getFwStatus().then((after) => {
                options.before = before;
                options.after = after;
                return addAuditLogEntry(options);
            });
        });
    });
}

The api call will call enableFirewall and I do see when stepping the code that it reaches and executes without error

disableFirewallTimer = setTimeout(() => {
        disableFirewall({ user: '--disable timer--'});
    }, timeout);

and that the timeout parameter indeed is set to 30000 as I expected it to be. But a breakpoint in the anonymous function never get hit (and the firewall doesn't get called either).

One of the "background tasks" that are not working any more

function keepNtpStatusFresh() {
    getNtpStatusOutputInternal().then((data) => {
        lastNtpCheck.data = data;
        lastNtpCheck.when = new Date();
    }).catch((error) => {
        lastNtpCheck.data = null;
        sails.log.error(error);
    }).then(() => {
        setTimeout(keepNtpStatusFresh, 5000);
    });
}

function getNtpStatusOutputInternal() {
    return execp(sails.config.paneda.ntpStatus, { timeout: 30000 }).then((result) => {
        var re = new RegExp(/^(.)([^\s=]+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)/gm);
        var match = re.exec(result.stdout);

        var servers = [];
        while (match != null) {
            servers.push({
                type: match[1],
                remote: match[2],
                refid: match[3],
                st: Number(match[4]),
                t: match[5],
                when: Number(match[6]),
                poll: Number(match[7]),
                reach: Number(match[8]),
                delay: Number(match[9]),
                offset: Number(match[10]),
                jitter: Number(match[11]),
            });
            match = re.exec(result.stdout);
        }
        return { output: result.stdout, serverList: servers };
    });
} 
benjamingr commented 6 years ago
ping 1533808667121
TIMER 61156: 500 list wait because diff is -1
TIMER 61156: timeout callback 500
TIMER 61156: now: 116841691

My test based on the above is still running for several days now by the way.

tkarls commented 6 years ago

Yes, also since our servers were restarted they are working fine for little over 10 days now...

I have tried to check when the servers were upgraded and when the error showed up on each server. It looks like they all failed after about 25 days of run-time.

If this was a totally random problem or some sort of race condition I would have expected one server to fail after a few days, one to fail after a few weeks and perhaps 1 should still be running. 100% of the upgraded servers did fail. They are also running in completely different environments.

My brain can't help noticing that 25 days is almost exactly the number of milliseconds that can fit into a signed 32 bit integer (2 160 000 000 vs 2 147 483 647 ). Not sure if this is a pure coincidence (especially since Javascript doesn't have signed 32 bit integers AFAIK) but I'm mentioning it anyway :)

benjamingr commented 6 years ago

@tkarls as far as I know livuv uses 64bit ints for timers from what I can tell and Node just calls into libuv for checking.

Maybe @cjihrig or @saghul will know? (since libuv timers)

(We do cap the timeout itself to 32bit integers but not the number of repeats).

I wonder if this correlates with a libuv update - I'd try and bisect but for a bug that takes ~25 days to reproduce I don't think it'll help.

tkarls commented 6 years ago

@apapirovski @benjamingr @Fishrock123

I was just able to reproduce this with a smaller example 😁

I used libfaketime to speed things up by a factor of 1000 on a linux server and executed a test program containing functions scheduling themselves in the callback and printing the current date. The simulated start date is today (2018-08-09) and the last print is in september 03 wich happened to be 25 days!

Here is the script for reference:

function timer1 () {
    console.log('timer1' + new Date().toString());
    setTimeout(timer1, 500);
}

function timer2 () {
    console.log('timer2' + new Date().toString());
    setTimeout(timer2, 500);
}

function timer3 () {
    console.log('timer3' + new Date().toString());
    setTimeout(timer3, 500);
}

function timer4 () {
    console.log('timer4' + new Date().toString());
    setTimeout(timer4, 500);
}

function timer5 () {
    console.log('timer5' + new Date().toString());
    setTimeout(timer5, 1500);
}

function timer6 () {
    console.log('timer6' + new Date().toString());
    setTimeout(timer6, 500);
}

function timer7 () {
    console.log('timer7' + new Date().toString());
    setTimeout(timer7, 500);
}

function timer8 () {
    console.log('timer8' + new Date().toString());
    setTimeout(timer8, 500);
}

function timer9 () {
    console.log('timer9' + new Date().toString());
    setTimeout(timer9, 500);
}
function timer10 () {
    console.log('timer10' + new Date().toString());
    setTimeout(timer10, 500);
}

timer1();
timer2();
timer3();
timer4();
timer5();
timer6();
timer7();
timer8();
timer9();
timer10();

In the middle of september 3 the output stopped. No error other output in the console.

Here is how I started it: faketime -f "+10 x1000" node timer_test

@benjamingr So, libuv... I'm not at all familiar with that lib. But I suppose the bug could be there instead...

apapirovski commented 6 years ago

I think I know the bug... confirming right now. Ugh.

benjamingr commented 6 years ago

@tkarls excellent work isolating it. libuv is the underlying i/o library Node uses (it was written for Node but is used by others now) - it sounds like Anatoli has some idea about where the bug is though :)

tkarls commented 6 years ago

Great that the cause is identified now 👍

I wonder why all these strange errors always happen to me 😂

newbreach commented 6 years ago

I'm very happy. Don't worry about the boss getting angry! 😁Thx👍 @tkarls @benjamingr @apapirovski

benjamingr commented 6 years ago

The fix landed and should be out at the next release. Thanks for the report!

ghost commented 6 years ago

For what it's worth you are not supposed to rely on timers staying running for a month. That said, I haven't run into this sort of thing before

So much for having a gameserver be up for over a month! The absolute horror!

benjamingr commented 6 years ago

So much for having a gameserver be up for over a month! The absolute horror!

Note that what I meant by that isn't that your services aren't expected to be up for a month. It's that redundancy is expected (2+ instances behind a load balancer) so servers are resilient to failure.

There are plenty of cases where you need to restart your server (sceptre/meltdown style attacks) regardless of Node. I was just pointing out that relying on it being up for a month isn't something I'd personally recommend doing.

murrayju commented 6 years ago

The fix landed and should be out at the next release

@benjamingr What version(s) contain(s) (or will contain) this fix?

benjamingr commented 6 years ago

Node v10.9.0

halis commented 6 years ago

Just use setTimeout instead and use recursion to call it again.

tkarls commented 6 years ago

@halis Did you read the thread? 😂

toymachiner62 commented 5 years ago

Do we know what versions of Node this is affecting? We're currently on 9.x and was going to start implementing some polling using setTimeout/setInterval and i'm worried we might hit this issue.

Also do we know if it's only when using setInterval for a long period of time? In my case we will use setInterval along with clearInterval and i'm curious if that prevents this issue from occurring.

peernohell commented 5 years ago

Hi, We have some similar issue but with node 8. Does this version is impacted by this bug and will it get a fix in a minor version ? Thanks ❤️

newbreach commented 5 years ago

Hi, We have some similar issue but with node 8. Does this version is impacted by this bug and will it get a fix in a minor version ? Thanks ❤️

升级到10.9.0以上吧,就是18年8月15日以后的版本都可以

Fishrock123 commented 5 years ago

This is fixed in Node 10.9.0

toymachiner62 commented 5 years ago

This is fixed in Node 10.9.0

@peernohell I guess we're not getting the answer to our questions 😆

nhathongly commented 4 years ago

I’ve got the same issue with node.js 12.x. Either setInterval or setTimeout unexpectedly stop without any error after some few weeks.

adams-family commented 4 years ago

@tkarls Excellent job isolating the problem! @benjamingr Thanks for the fix!

We know that Node 10.9 fixes the issue, what about Node 12 and Node 14, can you please comment on those?

paynebc commented 4 years ago

I am having the same issue with timeouts in Node 12.18.3 After a couple of days, it just stops working without any error. Doesn't seem to be 25 days though

adams-family commented 4 years ago

@paynebc Any chance that you could reproduce this with libfaketime as tkarls did above?

https://github.com/nodejs/node/issues/22149#issuecomment-411781909

paynebc commented 4 years ago

I will give this a try

tkarls commented 4 years ago

FYI:

I ran my script above with libfaketime again and it does not detect any problem on node 12.18.3 Still running and scheduling stuff for "3 years" now.

adams-family commented 4 years ago

Interesting, it must be a different problem then. I'm thinking about renting a VPS and running like 5-10 major versions of NodeJS in parallel with libfaketime. Would that bring a value to anyone?

paynebc commented 4 years ago

I tried the script above now, too. It does indeed run totally fine, so whatever it is I have must be of different origin. But it does show exactly the same symptoms: every couple days, my main loop just stops working. I am using the following function and calling it with await in my main loop instead of passing my mainloop function directly to setTimeout

exports.sleep = function(millis) { return new Promise(resolve => setTimeout(resolve, millis)); };

I am now running my application using NODE_DEBUG=* in the hopes of finding anything useful.

adams-family commented 4 years ago

It's interesting because you are using setTimeout as opposed to setInterval in this topic. Do I understand correctly that this is what your code looks like?

while (true) {
  await sleep(500);
  // …
}

Are you absolutely sure that it is the sleep function where your code hangs? (Such as by putting a console.log before and after the sleep function?) Perhaps could it be something else in your loop?

paynebc commented 4 years ago

yes this is how it works. I now added some more logging to find out if this is indeed where it hangs. I will report back once I have an idea what it might be