sot / jobwatch

Watch files, database tables, and log files to ensure valid cron processing
3 stars 0 forks source link

Higher-frequency checking with alerts #9

Closed jeanconn closed 9 years ago

jeanconn commented 9 years ago

Add changes to support use as an hourly checker for replan central.

jeanconn commented 9 years ago

I think this is runnable in this state and I'll run it on the side now. If you don't like the H5 checks, they can each be changed to a FileWatch. If you don't like the iFOT checks, they can obviously be cut.

taldcroft commented 9 years ago

Looks great! The only thing I'm really not happy with is having skawatch classes be in days and the rest be in hours. I think the original unit of days is generally more appropriate. So for instance in arcwatch you could just keep everything in days but define a constant HOUR and specify things in multiples of that.

Also I'd be happier if arcwatch.py was named something more generic since there isn't anything in the code or design that is specific to arc files.

jeanconn commented 9 years ago

Sure. And setting the fundamental time unit to be the smaller one made more sense to me, but I suppose numeric stability is not an issue here.

taldcroft commented 9 years ago

We really should be using astropy quantities and Time anyway. Then any unit is fine. When 1.0 comes out we need to upgrade. Astropy.time is a lot better than Chandra.Time in every way.

jeanconn commented 9 years ago

I couldn't think of other stuff we'd add to the checking to figure out another common denominator wrt a name, though I went for hourly_template.html, so I suppose it could be hourly _watch. That seems worse.

Whatever we call it, should it end up with its own task_schedule and such? What makes the most sense to you for how to run it?

taldcroft commented 9 years ago

I was thinking about hourly_watch as well. That actually conveys a lot of meaning that arcwatch would not, especially for anyone else. In all likelihood it will really be an hourly watch, but even if it drifts in frequency that wouldn't bother me too much. It's like how "weekly schedule" isn't really weekly, but it conveys a certain meaning.

taldcroft commented 9 years ago

One way to think about it is that, because you insisted on doing things correctly (:smile:), we are really checking the upstream products as well as arc (at least for web products with correct header info). So the scope is beyond just arc because we might report problems to MTA or SWPC.

taldcroft commented 9 years ago

Yes, this will probably need its own task_schedule and a separated task directory for the logs. I'm trying to remember if just setting the task in the config file to something else is sufficient. It might make sense to set this up as a persistent running job like arc.

jeanconn commented 9 years ago

Though as far as I can tell the only real benefit to a persistent job is that the task log rotation works better. We might not care about that for this task.

taldcroft commented 9 years ago

I think the other benefit is that you can have the checker cron job running on different machines for reliability, with only one instance getting the chance to actually run.

jeanconn commented 9 years ago

I think that could also be accomplished by setting up the cron start times to work reasonably with the heartbeat timeout, but maybe I'm missing a subtlety.

taldcroft commented 9 years ago

That would be an OK workaround if we didn't already have the capability built in to task schedule... On Jan 6, 2015 9:59 AM, "Jean Connelly" notifications@github.com wrote:

I think that could also be accomplished by setting up the cron start times to work reasonably with the heartbeat timeout, but maybe I'm missing a subtlety.

— Reply to this email directly or view it on GitHub https://github.com/sot/jobwatch/pull/9#issuecomment-68875336.

jeanconn commented 9 years ago

Isn't that how we use the capability in task_schedule even for the persistent case? (offset cron jobs start times, reasonable heartbeat value)? ​

taldcroft commented 9 years ago

In the persistent case there's no fine tuning with the timeout because task schedule touches the heartbeat file every minute. On Jan 6, 2015 10:45 AM, "Jean Connelly" notifications@github.com wrote:

Isn't that how we use the capability in task_schedule even for the persistent case? (offset cron jobs start times, reasonable heartbeat value)? ​

— Reply to this email directly or view it on GitHub.

jeanconn commented 9 years ago

​Ah. Thanks. I'd forgotten that.

jeanconn commented 9 years ago

I think the hourly_watch portion of this is safe and effective. Do you have any concerns about the changes in jobwatch.py designed largely to make it easier to make a single status page? Otherwise, I think this PR can go.

taldcroft commented 9 years ago

:+1: