neon-cluster-operator: wrap-up issues

jefflill commented 2 months ago

I've verified that cluster stabilization works and I've coded the neon-cluster-operator job but couldn't test that due to: https://github.com/nforgeio/neonKUBE/issues/1900

jefflill commented 2 months ago

[x] The scheduling problem with the terminated pod GC job was due to a mistyped CRON schedule, which I've fixed.

The other thing that's a bit odd with the current design is that the job classes all reference CronJob as the base class to pick up the Name, Group, and Type properties as well as the AddToSchedulerAsync() and DeleteFromSchedulerAsync() methods. The current design has us create job instances that are never actually executed but instead are used for managing job schedules.

This seems like a bit of anti-pattern and in fact, I added TerminatedPodGcDelayMilliseconds and TerminatedPodGcThresholdMinutes properties to the TerminatedPodGcJob and set those on the global job, thinking that instance was going to be executed. After playing with Quartz a bit, I see that Quartz constructs a new job instance every time a job is executed, so TerminatedPodGcDelayMilliseconds/TerminatedPodGcThresholdMinutes will never be set the way I coded things.

[x] I'm going to refactor this by removing the CronJob base class and relocating the scheduling code as well as passing the TerminatedPodGcDelayMilliseconds/TerminatedPodGcThresholdMinutes properties in the parameter dictionary.
[x] All of the jobs run immediately first and then start running on the CRON schedule. This doesn't really make sense. For example, we don't want to run the terminated pod GC job just because neon-cluster-operator was rescheduled; we have the CRON schedules for a reason. Other jobs like certificate renewal run once a week so it really doesn't make sense to execute this just because the operator restarted.

Hopefully, cluster setup isn't depending on the cluster operator promptly executing jobs because that's pretty fragile and this change will probably break things. We'll address any problems like this by having cluster setup do any configuration explicitly.
[x] Random CRON schedule field multiple executions: we use our custom "R" field characters for some CRON fields to randomize when jobs execute to avoid the potential of having a large number of clusters doing things like transmitting telemetry pings at exactly the same time. This works OK, but it's possible for jobs to be re-run when leadership changes. Here's the scenario:
1. We have a job scheduled with "R R 0 ? " which will fire some time between 12:00am and 1:00am
2. Let's say the first operator resolves this to "0 0 0 ? " (12:00am) and executes the job
3. Then shortly after the job completes, the first operator loses leadership to a second operator
4. The second operator re-resolves the date to "0 15 0 ? " (12:15am)
5. Since 12:15am is still in the future, the second operator instance will re-run the job
I think the way to address this is by only resolving these random fields once and then storing the resolved CRON schedule in V1NeonClusterJobs.status and using these status CRON values for when subsequent operator instances schedule jobs.
[x] Some jobs are not updating their status in V1NeonClusterJobs:
- [x] TerminatedPodGcJob
- [x] MinWorkerNodeVcpuJob
[x] neon-cluster-operator needs RBAC access to namespaces and pods for the TerminatedPodGcJob.
[x] Implement the LinuxSecurityPatch job.

jefflill commented 2 months ago

DONE

nforgeio / neonKUBE

neon-cluster-operator: wrap-up issues #1901