Open dropthemic opened 6 years ago
Thank you for the bug reports. I think there are two separate issues going on there. The first one sounds like filesystem corruption, which can happen if Cronicle crashes, is uncleanly shut down or the server is rebooted while data was being written to disk. Unfortunately the only real fix here is to delete your historical job data. If you don't want to delete your individual job logs and metadata, you can just delete and recreate the main completed list itself:
/opt/cronicle/bin/control.sh stop
/opt/cronicle/bin/storage-cli.js list_delete logs/completed
/opt/cronicle/bin/storage-cli.js list_create logs/completed
/opt/cronicle/bin/control.sh start
The second issue you reported is quite concerning (the ticker stopping), as I have just now run into this same thing on 4 different servers, all CentOS and all Node.js v10. I am actively investigating it, but it is very difficult to track down, because at least in my case it takes 20+ days of continuous running to reproduce it.
I'll let you know if/when I resolve this, and I am going to add a large, loud disclaimer to the top of the main repository README, alerting people to this one. This is really bad.
Here is more information about the bug:
Okay, so I am seeing more and more evidence that the root cause is in fact the master setInterval() timer, which actually lives in pixl-server, not even in Cronicle. Cronicle just uses it for a variety of things.
(1) The detached job queue system in Cronicle, which polls the /opt/cronicle/queue directory every minute, looking for detached job update JSON files, was clearly not working on bidi or dev05. This system works off the master pixl-server 1s tick timer.
(2) Cronicle's daily log rotation system, which runs on every server, also uses the pixl-server 1s tick timer, specifically the "day" event which is emitted at midnight, was not working on dev05 nor bidi. Both of them had old logs from August 21 (meaning the "day" tick did not fire at midnight on Aug 22). This is also more evidence that points directly at the master pixl-server 1s tick timer.
(3) When I logged into Cronicle on Stage, I noticed that the clock in the browser UI was not ticking up. The animated clock logo was not moving, and the time in the top-right was not changing. These things are all triggered by server-sent events, that work off the master pixl-server 1s tick timer.
These all point towards the master pixl-server 1s tick timer magically stopping for some reason. This is a global setInterval() timer that is set up at startup, and lives forever, until shutdown. I cannot fathom anything in code ever canceling this interval.
https://github.com/jhuckaby/pixl-server/blob/master/server.js#L272
Okay, I am seeing strong evidence that all the Node.js timers were failing. Check this out. In Stage, I see this logged:
[1534858789.948][2018-08-21 06:39:49][devrelease.xxx][Cronicle][debug][5][Marking slave as disabled: mgr01.xxx][]
[1534858789.955][2018-08-21 06:39:49][devrelease.xxx][Cronicle][debug][5][Marking slave as disabled: mgr02.xxx][]
This is right when the master server lost connectivity with mgr01 and mgr02. Now, the Cronicle code automatically tries to reconnect to lost slaves right after this happens. It reconnects automatically and retries indefinitely. But it never happened. It never even tried once.
Check out the code: https://github.com/jhuckaby/Cronicle/blob/master/lib/comm.js#L231-L237
That bit of code never ran in Stage (i.e. it never logged "Reconnecting to slave:"), and this runs under its own separate setTimeout() call (i.e. not the pixl-server 1s interval timer). Meaning, this is evidence that the entire Node.js timer system got borked.
Furthermore, the other question is WHY did it lose connectivity with mgr01/mgr02 and at the same time as everything else? Well, the answer is again timers, I believe:
[1534858789.948][2018-08-21 06:39:49][devrelease.xxx][Error][error][server][Slave disconnected unexpectedly: mgr01.xxx][]
[1534858789.954][2018-08-21 06:39:49][devrelease.xxx][Error][error][server][Slave disconnected unexpectedly: mgr02.xxx][]
As you can see here, devlelease.stage (Cronicle master) says that mgr01 and mgr02 disconnected unexpectedly. But look at what mgr01 and mgr02 have to say about this:
[1534858789.946][2018-08-21 06:39:49][mgr02.xxx][Cronicle][debug][5][Socket.io client disconnected: _IXAzsPZvXRS6aEpAAAA (IP: ::ffff:100.xx)][]
[1534858789.946][2018-08-21 06:39:49][mgr01.xxx][Cronicle][debug][5][Socket.io client disconnected: oyMlpxiFAbZp4uSzAAAA (IP: ::ffff:100.xx)][]
They say that the MASTER disconnected from THEM! So both are pointing the finger at the other.
Here is what I believe actually happened. Socket.io fired a "disconnect" event for both servers on the master, but it never actually lost connectivity. It disconnected because of a loss of heartbeat. Socket.io has its own heartbeat "echo packet" system that keeps TCP connections alive, and it runs on its own setInterval. This also failed at the same time as everything else, resulting in the sockets hitting idle timeout and dying.
I still don't know WHY yet, but it looks like the entire Node.js timer subsystem failed. One reason could be that the event loop was totally CPU-bound, but I see no other evidence of that. The CPUs were asleep on all affected servers (dev05, bidi, and devrelease). If the main process was CPU-pinned (which would indeed cause timers to not fire), then we would have seen this in Performa. But no, the load averages were all very low all morning long.
Investigation ongoing, but I don't have much else to go on at the moment. Thank you for your report, however, because that allows me to rule out AWS / S3 as a culprit.
I'm trying to delete the logs and I can't get pass this error:
$/opt/cronicle/bin/storage-cli.js list_delete logs/completed
[1535483122.294][ERROR][file] Failed to delete file: logs/completed/-1336: data/logs/f9/df/dd/f9dfdd4bb0628b185571c368b5be9cd0.json: File not found
/opt/cronicle/bin/storage-cli.js:396
if (err) throw err;
Ah crap, well that confirms my suspicion that the data got corrupted. The list header doesn't have the correct number of pages. This is awful, and I am so sorry this happened. For now, just to get you back up and running, you can simply delete the list header (i.e. leave all the pages) and I think it'll be smart enough to start over and ignore the mess left behind:
/opt/cronicle/bin/control.sh stop
/opt/cronicle/bin/storage-cli.js delete logs/completed
/opt/cronicle/bin/storage-cli.js list_create logs/completed
/opt/cronicle/bin/control.sh start
So instead of list_delete
, just do a regular delete
.
I am working on a new "transaction" system that should provide atomic list writes, and automatic rollback on crash, as to prevent things like this in the future. Currently slated for v2.0, but I think in light of this I should bump up the priority on that.
Doing the "delete" worked perfectly thanks!
Regarding the ticker stopping, in my system the problem occurs every 30 days. For now, I have a cron task to recycle the cronicle service.
Looking forward to v2.0.
Thank you, @dropthemic, that timeline actually really helps me. So far I've seen this on 4 of my servers, but all of them had a very high uptime, like 27 days or something.
Do you recall if you ever saw this on Node versions prior to 10? I am trying to nail down exactly when this started, and I just recently upgraded to 10, so that is currently a suspect.
Thanks again for your time.
Looking at my yum history, I started with v0.10.48. Sorry never tried prior to version 10.
Got it, thank you!
Hey @dropthemic, just FYI this looks like an actual bug in Node.js core! See #108 for details.
It's now fixed and landed, but won't be released until Node v10.10 (presumably). PR for fix: https://github.com/nodejs/node/pull/22214
Wow this is good news! Thanks for all your help and openness on changes.
Hey @dropthemic, just letting you know, it looks like the Node.js core timer bug is actually fixed in Node v10.9.0. I verified this myself using libfaketime and simulating 25 days of runtime in a few minutes. I was able to reproduce the bug in 10.5, but not on 10.9.
However, you reported Node version 10.9 in this issue report, so I cannot explain how that is possible. But, there are two separate issues reported here, the completed events paging issue (data corruption), and the Node timer bug. Is it possible that you actually haven't experienced the timer bug (Cronicle suddenly stops running jobs) on Node v10.9? Or do I have this all wrong?
Summary
In the "All Complete Jobs" section of UI, the jobs listed are stale and paging through the records stops after the "3rd" next page.
However, if I go to an individual event and see job history, I can see the history. It could an issue with my file system storage but I'm not sure how to resolve.
I also have an issue where Cronicle stops running tasks with no seg fault or any errors. Looking over my log files, the scheduler tick never advances:
Another example of scheduler not advancing:
Steps to reproduce the problem
N/A
Your Setup
Vanilla Install
Operating system and version?
CentOS 6.10
Node.js version?
v10.9.0
Cronicle software version?
Version 0.8.24
Are you using a multi-server setup, or just a single server?
Single
Are you using the filesystem as back-end storage, or S3/Couchbase?
FS
Can you reproduce the crash consistently?
Yes
Log Excerpts