Closed tdcox closed 5 years ago
@rawlingsj I have just observed this happen on a fresh cluster. It looks like the cluster auto-scaled down from three to two running nodes triggering a restart of a number of Pods as they were flushed from the terminating node. After this, I ended up with one working deck pod and one in a crash loop.
The failed pod is repeating this error once per second so it should probably have a Circuit Breaker too.
{"component":"deck","error":"invalid presubmit job promotion-build: agent must be one of jenkins, knative-build, knative-pipeline-run, kubernetes (found \"tekton\")","jobConfig":"","level":"error","msg":"Error loading config.","prowConfig":"/etc/config/config.yaml","time":"2019-03-18T16:13:05Z"}
Confirmed. If you scale the Deployment for deck down to zero and back up again, it fails to recover. Oops!
I am having the same issue when installing NextGen cluster jx install --provider gke --ng
(Tekton, Vaults and No-Tiller).
Anyway I see in the configmap "config" that the agent specified is tekton, but it seems that is not supported or not recognized as valid value:
$ kubectl get cm config -o yaml
apiVersion: v1
data:
config.yaml: |
[...]
deck:
spyglass: {}
gerrit: {}
owners_dir_blacklist:
default: null
repos: null
plank: {}
pod_namespace: jx
postsubmits:
dcanadillas-kube/environment-jx-nextgen-production:
- agent: tekton
branches:
- master
context: ""
name: promotion
dcanadillas-kube/environment-jx-nextgen-staging:
- agent: tekton
branches:
- master
context: ""
name: promotion
jenkins-x/dummy:
- agent: tekton
branches:
- master
context: ""
name: release
presubmits:
dcanadillas-kube/environment-jx-nextgen-production:
- agent: tekton
always_run: true
context: promotion-build
name: promotion-build
rerun_command: /test this
trigger: (?m)^/test( all| this),?(\s+|$)
dcanadillas-kube/environment-jx-nextgen-staging:
- agent: tekton
contexts:
always_run: true
context: promotion-build
name: promotion-build
rerun_command: /test this
trigger: (?m)^/test( all| this),?(\s+|$)
jenkins-x/dummy:
- agent: tekton
always_run: true
context: serverless-jenkins
name: serverless-jenkins
rerun_command: /test this
trigger: (?m)^/test( all| this),?(\s+|$)
prowjob_namespace: jx
push_gateway: {}
sinker: {}
tide:
[...]
Could it be related to Prow not supporting Tekton Pipelines?? (https://github.com/tektoncd/pipeline/issues/537).
I'm having the same issue when using jx install --prow=true --tekton=true --provider=eks
Did anyone found a way to resolve this?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://jenkins-x.io/community.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://jenkins-x.io/community.
/lifecycle rotten
Summary
I am looking at a test cluster that is about 15hrs old. It was created and has had a single golang-http quickstart executed against it shortly after setup.
I am seeing two deck pods, with one in long term CrashLoopBackOff:
The failing pod reports:
And the container log is:
Steps to reproduce the behavior
Create cluster instance with:
Then run a single quickstart:
Expected behavior
Pod to restart after failure.
Actual behavior
Zombie Pod
Jx version
The output of
jx version
is:Jenkins type