Upgrading openfga helm chart ends in failure due to migration job

jliedy commented 1 year ago

After installing openfga, running helm upgrade fails due to the chart template trying to update a job.

Error: UPGRADE FAILED: cannot patch "openfga-migrate" with kind Job: Job.batch "openfga-migrate" is invalid: [spec.template.metadata.labels[controller-uid]: Required value: must be 'redacted', spec.template.metadata.labels[job-name]: Required value: must be 'openfga-migrate', spec.template.metadata.labels: Invalid value: map[string]string(nil): `selector` does not match template `labels`, spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume(nil), InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:"migrate-database", Image:"openfga/openfga:v1.3.1", Command:[]string(nil), Args:[]string{"migrate"}, WorkingDir:"", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar{core.EnvVar{Name:"OPENFGA_DATASTORE_ENGINE", Value:"postgres", ValueFrom:(*core.EnvVarSource)(nil)}, core.EnvVar{Name:"OPENFGA_DATASTORE_URI", Value:"redacted", ValueFrom:(*core.EnvVarSource)(nil)}}, Resources:core.ResourceRequirements{Limits:core.ResourceList(nil), Requests:core.ResourceList(nil)}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullPolicy:"IfNotPresent", SecurityContext:(*core.SecurityContext)(0xc0b2d8d9e0), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:"Never", TerminationGracePeriodSeconds:(*int64)(0xc06db168c8), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"openfga", AutomountServiceAccountToken:(*bool)(nil), NodeName:"", SecurityContext:(*core.PodSecurityContext)(0xc08afaddd0), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:"", Subdomain:"", SetHostnameAsFQDN:(*bool)(nil), Affinity:(*core.Affinity)(nil), SchedulerName:"default-scheduler", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil), OS:(*core.PodOS)(nil)}}: field is immutable]

As far as I'm aware, jobs like this are immutable. You can use hooks to have a job delete itself after a successful run: https://helm.sh/docs/topics/charts_hooks/

I'll have a pull request completed momentarily to have this added.

michael-lavacca-rw commented 10 months ago

We had the same issue and found that these jobs can be cleaned up automatically by using the .spec.ttlSecondsAfterFinished property in the job definition.

Reference from k8s: https://kubernetes.io/docs/concepts/workloads/controllers/job/#clean-up-finished-jobs-automatically

Made a quick PR for it here, just waiting for review: https://github.com/openfga/helm-charts/pull/96

We found this to be a lower-risk, backwards-compatible (being an optional parameter), tested and working way for our jobs to not conflict like in the error you're seeing.

I'm sure both methods achieve the desired behavior, though I would think having both as an option is better than one or none. :)

jliedy commented 10 months ago

That's a viable option if still using v1.6 of the k8s-wait-for task, as it will return as "ok" if the job doesn't exit or if the job is successful. Typically it's fine, but you could run the risk of your application incorrectly starting if the migration fails and then the job gets deleted. If you're on v2.0, the task will only succeed if it's able to verify that the defined job exists and has run successfully. The risk of that is if you run an update/upgrade and the migration job fails, any pods that start (like during a HPA scaling event) or restart will fail to run until the migration job has run successfully.

openfga / helm-charts

Upgrading openfga helm chart ends in failure due to migration job #69