Closed jefflill closed 2 months ago
The other thing that's a bit odd with the current design is that the job classes all reference CronJob
as the base class to pick up the Name
, Group
, and Type
properties as well as the AddToSchedulerAsync()
and DeleteFromSchedulerAsync()
methods. The current design has us create job instances that are never actually executed but instead are used for managing job schedules.
This seems like a bit of anti-pattern and in fact, I added TerminatedPodGcDelayMilliseconds
and TerminatedPodGcThresholdMinutes
properties to the TerminatedPodGcJob and set those on the global job, thinking that instance was going to be executed. After playing with Quartz a bit, I see that Quartz constructs a new job instance every time a job is executed, so TerminatedPodGcDelayMilliseconds/TerminatedPodGcThresholdMinutes
will never be set the way I coded things.
[x] I'm going to refactor this by removing the CronJob
base class and relocating the scheduling code as well as passing the TerminatedPodGcDelayMilliseconds/TerminatedPodGcThresholdMinutes
properties in the parameter dictionary.
[x] All of the jobs run immediately first and then start running on the CRON schedule. This doesn't really make sense. For example, we don't want to run the terminated pod GC job just because neon-cluster-operator was rescheduled; we have the CRON schedules for a reason. Other jobs like certificate renewal run once a week so it really doesn't make sense to execute this just because the operator restarted.
Hopefully, cluster setup isn't depending on the cluster operator promptly executing jobs because that's pretty fragile and this change will probably break things. We'll address any problems like this by having cluster setup do any configuration explicitly.
[x] Random CRON schedule field multiple executions: we use our custom "R" field characters for some CRON fields to randomize when jobs execute to avoid the potential of having a large number of clusters doing things like transmitting telemetry pings at exactly the same time. This works OK, but it's possible for jobs to be re-run when leadership changes. Here's the scenario:
I think the way to address this is by only resolving these random fields once and then storing the resolved CRON schedule in V1NeonClusterJobs.status and using these status CRON values for when subsequent operator instances schedule jobs.
[x] Some jobs are not updating their status in V1NeonClusterJobs:
[x] neon-cluster-operator needs RBAC access to namespaces and pods for the TerminatedPodGcJob.
[x] Implement the LinuxSecurityPatch job.
DONE
I've verified that cluster stabilization works and I've coded the neon-cluster-operator job but couldn't test that due to: https://github.com/nforgeio/neonKUBE/issues/1900