Closed spilchen closed 7 months ago
Is this related to the crash from yesterday?
Is this related to the crash from yesterday?
Yes, we were repeatedly doing reconcile iterations. Eventually, the K8s OOMKiller stepped in and killed the pod. We were allocating/freeing too much memory that the garbage collection couldn't keep up.
In ROSA (RedHat OpenShift on AWS), we noticed that when setting up a network LoadBalancer an annotation would automatically be added to the Service object. This caused the operator to start a reconcile loop. It would remove the annotation, only to have OpenShift add it back. So, the operator was in a continuous reconcile loop.
To fix this we now allow manual annotations be added to service objects. The operator will only ensure the annotations that it generates are the correct value. It will ignore any additional annotation that was added outside of the VerticaDB.
I am also cleaning up the PVC expansion events. This can cause quite a lot of noise about skipping expansion if we continuously are doing reconciles. The skip events have been changed to log entries instead.