Closed jieyu closed 4 years ago
Even with the workaround (put get-gateway-hostname to be in a separate phase), i still occasionally experience the issue.
@jieyu do you have a KUDO manager log for me, from the moment deploy
plan starts until it ends in the above error?
Ok, I believe I got to the bottom of it. The problem has nothing to do with the pipe task as such but rather with the usage of Jobs although putting a job into the same step as a pipe task certainly triggers the problem more often (due to the fact that the next reconciliation is scheduled much faster).
The exact same problem is easier to replicate with the newly introduced manual plan trigger
feature, so I'll use it to demonstrate.
Take any long-running job:
apiVersion: batch/v1
kind: Job
metadata:
name: busy
spec:
template:
spec:
containers:
- name: busy
image: busybox
command: ["/bin/sh", "-c"]
args: ["sleep infinity"]
restartPolicy: Never
backoffLimit: 3
Make it part of a deploy
plan and install a dummy operator:
apiVersion: kudo.dev/v1beta1
name: "dummy"
operatorVersion: "0.1.0"
kubernetesVersion: 1.15.0
maintainers:
- name: zen-dog
url: https://kudo.dev
tasks:
tasks:
- name: job
kind: Apply
spec:
resources:
- job.yaml
plans: deploy: strategy: serial phases:
deploy
plan is still running, and we trigger it again with k kudo plan trigger --name deploy --instance dummy-instance
. Note, that you need to run the KUDO manager with ENABLE_WEBHOOKS=true
for this command to work. Observe the problem:
PlanExecution: A transient error when executing task deploy.deploy.dummy.job.
Will retry. failed to patch default/busy:
failed to execute patch:
Job.batch "busy" is invalid: spec.template: Invalid value: core.PodTemplateSpec{...}: field is immutable
TL;DR: the issue boils down to the fact, that the plan is reconciled faster than the previous Status is saved. This raciness in the operator pattern is well described in #1116 This second reconciliation leads to us trying to patch an existing Job pod template which is immutable:
This problem exists for any plan that has jobs.
Fixed by this commit
Using 0.11.0-rc1.
Observed this issue in kudo controller log:
The job spec is here:
And the pipe task pod:
operator.yaml
if I put get-gateway-hostname to be in a separate phase, the problem disappears. so i suspect some kind of race condition. looks like the job is created for some reason, and the subsequent reconcile tries to patch it. and job cannot be patched if it's already done (looks like)