Deck pod in crash loop when using Tekton

tdcox commented 5 years ago

Summary

I am looking at a test cluster that is about 15hrs old. It was created and has had a single golang-http quickstart executed against it shortly after setup.

I am seeing two deck pods, with one in long term CrashLoopBackOff:

jx            deck-5fbbdc9478-kpxgc                                 1/1     Running            3          15h
jx            deck-5fbbdc9478-xt8k5                                 0/1     CrashLoopBackOff   177        15h

The failing pod reports:

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 15 Mar 2019 09:09:32 +0000
      Finished:     Fri, 15 Mar 2019 09:09:32 +0000
    Ready:          False
    Restart Count:  177

And the container log is:

time="2019-03-15T08:59:17Z" level=info msg="Spyglass registered viewer buildlog with title Build Log."
time="2019-03-15T08:59:17Z" level=info msg="Spyglass registered viewer junit with title JUnit."
time="2019-03-15T08:59:17Z" level=info msg="Spyglass registered viewer metadata with title Metadata."
{"component":"deck","error":"invalid presubmit job promotion-build: agent must be one of jenkins, knative-build, knative-pipeline-run, kubernetes (found \"tekton\")","level":"fatal","msg":"Error starting config agent.","time":"2019-03-15T08:59:17Z"}

Steps to reproduce the behavior

Create cluster instance with:

jx create cluster gke \
--cluster-name='d23' \
--default-admin-password='xxxxxx' \
--environment-git-owner='tdcox' \
--enhanced-apis=true \
--enhanced-scopes=true \
--git-username='tdcox' \
--git-private=false \
--kaniko=true \
--labels='demo=true' \
--machine-type='n1-standard-4' \
--max-num-nodes='3' \
--min-num-nodes='2' \
--no-tiller=true \
--preemptible=true \
--project-id='jx-mar19' \
--prow=true \
--skip-login=true \
--tekton=true \
--zone='europe-west1-d'

Then run a single quickstart:

➜ jx create quickstart
Using Git provider GitHub at https://github.com
? Do you wish to use tdcox as the Git user name? Yes

About to create repository  on server https://github.com with user tdcox
? Which organisation do you want to use? tdcox
? Enter the new repository name:  test107

Creating repository tdcox/test107
? select the quickstart you wish to create golang-http
Generated quickstart at /Users/terry/Documents/code/jxtesting/test107
### NO charts folder /Users/terry/Documents/code/jxtesting/test107/charts/golang-http
Created project at /Users/terry/Documents/code/jxtesting/test107

The directory /Users/terry/Documents/code/jxtesting/test107 is not yet using git
? Would you like to initialise git now? Yes
? Commit message:  Initial import

Git repository created
performing pack detection in folder /Users/terry/Documents/code/jxtesting/test107
--> Draft detected Go (65.746753%)
selected pack: /Users/terry/.jx/draft/packs/github.com/jenkins-x-buildpacks/jenkins-x-kubernetes/packs/go
replacing placeholders in directory /Users/terry/Documents/code/jxtesting/test107
app name: test107, git server: github.com, org: tdcox, Docker registry org: tdcox
skipping directory "/Users/terry/Documents/code/jxtesting/test107/.git"
Pushed Git repository to https://github.com/tdcox/test107

Creating GitHub webhook for tdcox/test107 for url http://hook.jx.35.241.195.78.nip.io/hook

Watch pipeline activity via:    jx get activity -f test107 -w
Browse the pipeline log via:    jx get build logs tdcox/test107/master
Open the Jenkins console via    jx console
You can list the pipelines via: jx get pipelines
When the pipeline is complete:  jx get applications

For more help on available commands see: https://jenkins-x.io/developing/browsing/

Note that your first pipeline may take a few minutes to start while the necessary images get downloaded!

Expected behavior

Pod to restart after failure.

Actual behavior

Zombie Pod

Jx version

The output of jx version is:

NAME               VERSION
jx                 1.3.974
jenkins x platform 0.0.3535
Kubernetes cluster v1.11.7-gke.4
kubectl            v1.13.4
helm client        Client: v2.13.0+g79d0794
git                git version 2.21.0
Operating System   Mac OS X 10.13.6 build 17G4015

Jenkins type

[ ] Classic Jenkins
[x] Serverless Jenkins

tdcox commented 5 years ago

@rawlingsj I have just observed this happen on a fresh cluster. It looks like the cluster auto-scaled down from three to two running nodes triggering a restart of a number of Pods as they were flushed from the terminating node. After this, I ended up with one working deck pod and one in a crash loop.

The failed pod is repeating this error once per second so it should probably have a Circuit Breaker too.

{"component":"deck","error":"invalid presubmit job promotion-build: agent must be one of jenkins, knative-build, knative-pipeline-run, kubernetes (found \"tekton\")","jobConfig":"","level":"error","msg":"Error loading config.","prowConfig":"/etc/config/config.yaml","time":"2019-03-18T16:13:05Z"}

tdcox commented 5 years ago

Confirmed. If you scale the Deployment for deck down to zero and back up again, it fails to recover. Oops!

dcanadillas commented 5 years ago

I am having the same issue when installing NextGen cluster jx install --provider gke --ng (Tekton, Vaults and No-Tiller).

Anyway I see in the configmap "config" that the agent specified is tekton, but it seems that is not supported or not recognized as valid value:

$ kubectl get cm config -o yaml
apiVersion: v1
data:
  config.yaml: |

 [...]

    deck:
      spyglass: {}
    gerrit: {}
    owners_dir_blacklist:
      default: null
      repos: null
    plank: {}
    pod_namespace: jx
    postsubmits:
      dcanadillas-kube/environment-jx-nextgen-production:
      - agent: tekton
        branches:
        - master
        context: ""
        name: promotion
      dcanadillas-kube/environment-jx-nextgen-staging:
      - agent: tekton
        branches:
        - master
        context: ""
        name: promotion
      jenkins-x/dummy:
      - agent: tekton
        branches:
        - master
        context: ""
        name: release
    presubmits:
      dcanadillas-kube/environment-jx-nextgen-production:
      - agent: tekton
        always_run: true
        context: promotion-build
        name: promotion-build
        rerun_command: /test this
        trigger: (?m)^/test( all| this),?(\s+|$)
      dcanadillas-kube/environment-jx-nextgen-staging:
      - agent: tekton
                contexts:
        always_run: true
        context: promotion-build
        name: promotion-build
        rerun_command: /test this
        trigger: (?m)^/test( all| this),?(\s+|$)
      jenkins-x/dummy:
      - agent: tekton
        always_run: true
        context: serverless-jenkins
        name: serverless-jenkins
        rerun_command: /test this
        trigger: (?m)^/test( all| this),?(\s+|$)
    prowjob_namespace: jx
    push_gateway: {}
    sinker: {}
    tide:

[...]

Could it be related to Prow not supporting Tekton Pipelines?? (https://github.com/tektoncd/pipeline/issues/537).

tsahiduek commented 5 years ago

I'm having the same issue when using jx install --prow=true --tekton=true --provider=eks

Did anyone found a way to resolve this?

jenkins-x-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://jenkins-x.io/community. /lifecycle stale

jenkins-x-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. Provide feedback via https://jenkins-x.io/community. /lifecycle rotten

jenkins-x / jx