Open alequint opened 1 year ago
Follow detailed investigation on how I got to this problem:
When subscribing to Red Hat Openshift Pipelines latest channel, currently pointing to openshift-pipelines-operator-rh.v1.11.0
, installation is not able to be finalized.
CSV Conditions:
conditions:
- lastTransitionTime: '2023-06-26T22:34:08Z'
lastUpdateTime: '2023-06-26T22:34:08Z'
message: requirements not yet checked
phase: Pending
reason: RequirementsUnknown
- lastTransitionTime: '2023-06-26T22:34:08Z'
lastUpdateTime: '2023-06-26T22:34:08Z'
message: one or more requirements couldn't be found
phase: Pending
reason: RequirementsNotMet
- lastTransitionTime: '2023-06-26T22:34:10Z'
lastUpdateTime: '2023-06-26T22:34:10Z'
message: 'all requirements found, attempting install'
phase: InstallReady
reason: AllRequirementsMet
- lastTransitionTime: '2023-06-26T22:34:10Z'
lastUpdateTime: '2023-06-26T22:34:10Z'
message: waiting for install components to report healthy
phase: Installing
reason: InstallSucceeded
- lastTransitionTime: '2023-06-26T22:34:10Z'
lastUpdateTime: '2023-06-26T22:34:10Z'
message: >-
installing: waiting for deployment openshift-pipelines-operator to
become ready: deployment "openshift-pipelines-operator" not available:
Deployment does not have minimum availability.
phase: Installing
reason: InstallWaiting
- lastTransitionTime: '2023-06-26T22:34:27Z'
lastUpdateTime: '2023-06-26T22:34:27Z'
message: install strategy completed with no errors
phase: Succeeded
reason: InstallSucceeded
Looking at TektonConfig
CR
conditions:
- lastTransitionTime: '2023-06-26T22:39:47Z'
message: >-
Components not in ready state: TektonTrigger: reconcile again and
proceed
reason: Error
status: 'False'
type: ComponentsReady
- lastTransitionTime: '2023-06-26T22:34:50Z'
status: Unknown
type: PostInstall
- lastTransitionTime: '2023-06-26T22:34:57Z'
status: 'True'
type: PreInstall
- lastTransitionTime: '2023-06-26T22:39:47Z'
message: >-
Components not in ready state: TektonTrigger: reconcile again and
proceed
reason: Error
status: 'False'
type: Ready
Looking at the pods, we see webhook and operator pods ready with all containers up… but some errors are logged:
Examples of tekton-operator-webhook:
{"level":"error","logger":"tekton-operator-webhook.ConfigMapWebhook","caller":"controller/controller.go:566","msg":"Reconcile error","commit":"c8ef1db","knative.dev/traceid":"300327b4-0334-4348-88a2-51e37174d622","knative.dev/key":"config.webhook.operator.tekton.dev","duration":0.000074601,"error":"secret \"tekton-operator-webhook-certs\" is missing \"ca-cert.pem\" key","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\t/go/src/github.com/tektoncd/operator/vendor/knative.dev/pkg/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\t/go/src/github.com/tektoncd/operator/vendor/knative.dev/pkg/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\t/go/src/github.com/tektoncd/operator/vendor/knative.dev/pkg/controller/controller.go:491"}
…missing an item in the secret tekton-operator-webhook-certs
, item that seems related to certificate authority, preventing reconcile. Current state of this secret shows the key is really there now, so probably was just a temporary error while secret was not yet created (this is the last occurrence of an error being logged in the pod)
data:
ca-cert.pem: >-
xxxx
server-cert.pem: >-
Xxxx
server-key.pem: >-
xxxx
type: Opaque
... now looking at openshift-pipelines-operator pod logs :
{"level":"debug","logger":"tekton-operator-lifecycle","caller":"controller/controller.go:562","msg":"Requeuing key config (by request) after 10s (depth: 0)","commit":"1d48540","knative.dev/pod":"openshift-pipelines-operator-6fb78797c5-mhszq","knative.dev/controller":"github.com.tektoncd.operator.pkg.reconciler.shared.tektonconfig.Reconciler","knative.dev/kind":"operator.tekton.dev.TektonConfig","knative.dev/traceid":"136b2c0d-3315-48f9-a744-2f5189458185","knative.dev/key":"config"}
{"level":"debug","logger":"tekton-operator-lifecycle","caller":"controller/controller.go:513","msg":"Processing from queue config (depth: 0)","commit":"1d48540","knative.dev/pod":"openshift-pipelines-operator-6fb78797c5-mhszq","knative.dev/controller":"github.com.tektoncd.operator.pkg.reconciler.shared.tektonconfig.Reconciler","knative.dev/kind":"operator.tekton.dev.TektonConfig"}
{"level":"info","logger":"tekton-operator-lifecycle","caller":"tektonconfig/tektonconfig.go:101","msg":"Reconciling TektonConfig","commit":"1d48540","knative.dev/pod":"openshift-pipelines-operator-6fb78797c5-mhszq","knative.dev/controller":"github.com.tektoncd.operator.pkg.reconciler.shared.tektonconfig.Reconciler","knative.dev/kind":"operator.tekton.dev.TektonConfig","knative.dev/traceid":"bab2a6bf-4f50-42ce-880b-f55cb153b570","knative.dev/key":"config","status":{"conditions":[{"type":"ComponentsReady","status":"False","lastTransitionTime":"2023-06-26T22:39:47Z","reason":"Error","message":"Components not in ready state: TektonTrigger: reconcile again and proceed"},{"type":"PostInstall","status":"Unknown","lastTransitionTime":"2023-06-26T22:34:50Z"},{"type":"PreInstall","status":"True","lastTransitionTime":"2023-06-26T22:34:57Z"},{"type":"Ready","status":"False","lastTransitionTime":"2023-06-26T22:39:47Z","reason":"Error","message":"Components not in ready state: TektonTrigger: reconcile again and proceed"...
{"level":"debug","logger":"tekton-operator-lifecycle","caller":"common/targetnamespace.go:39","msg":"reconciling target namespace","commit":"1d48540","knative.dev/pod":"openshift-pipelines-operator-6fb78797c5-mhszq","knative.dev/controller":"github.com.tektoncd.operator.pkg.reconciler.shared.tektonconfig.Reconciler","knative.dev/kind":"operator.tekton.dev.TektonConfig","knative.dev/traceid":"bab2a6bf-4f50-42ce-880b-f55cb153b570","knative.dev/key":"config","targetNamespace":"openshift-pipelines"}
{"level":"debug","logger":"tekton-operator-lifecycle","caller":"controller/controller.go:562","msg":"Requeuing key config...
TektonTrigger
is really pointing an error, looking at the CR we have:
status:
conditions:
- lastTransitionTime: '2023-06-26T22:39:47Z'
status: 'True'
type: DependenciesInstalled
- lastTransitionTime: '2023-06-26T22:39:56Z'
status: 'True'
type: InstallerSetAvailable
- lastTransitionTime: '2023-06-26T22:43:51Z'
message: >-
Installer set not ready: Main Reconcilation failed: TektonTrigger/main:
installer set not ready, will retry: Deployment:
tekton-triggers-core-interceptors deployment not ready
reason: Error
status: 'False'
type: InstallerSetReady
- lastTransitionTime: '2023-06-26T22:39:47Z'
status: Unknown
type: PostReconciler
- lastTransitionTime: '2023-06-26T22:39:47Z'
status: 'True'
type: PreReconciler
- lastTransitionTime: '2023-06-26T22:43:51Z'
message: >-
Installer set not ready: Main Reconcilation failed: TektonTrigger/main:
installer set not ready, will retry: Deployment:
tekton-triggers-core-interceptors deployment not ready
reason: Error
status: 'False'
type: Ready
version: v0.24.1
tekton-triggers-core-interceptors
Deployment expects 1 pod that never scales. tekton-triggers-core-interceptors
pod in openshift-pipelines
namespace shows repeatedly a warning saying, in the last lines of log, that:
{"level":"warn","ts":1687830914.5738263,"caller":"server/server.go:302","msg":"server key missing"}
2023/06/27 01:55:14 http: TLS handshake error from 10.113.225.194:33949: server key missing
… but we have some actual errors in the first messages logged in the pod:
{"level":"error","ts":1687819234.5741854,"caller":"server/server.go:297","msg":"failed to fetch secret secret \"tekton-triggers-core-interceptors-certs\" not found","stacktrace":"github.com/tektoncd/triggers/pkg/interceptors/server.GetTLSData\n\t/go/src/github.com/tektoncd/triggers/pkg/interceptors/server/server.go:297\nmain.startServer.func2\n\t/go/src/github.com/tektoncd/triggers/cmd/interceptors/main.go:114\ncrypto/tls.(*Config).getCertificate\n\t/usr/lib/golang/src/crypto/tls/common.go:1073\ncrypto/tls.(*serverHandshakeStateTLS13).pickCertificate\n\t/usr/lib/golang/src/crypto/tls/handshake_server_tls13.go:368\ncrypto/tls.(*serverHandshakeStateTLS13).handshake\n\t/usr/lib/golang/src/crypto/tls/handshake_server_tls13.go:55\ncrypto/tls.(*Conn).serverHandshake\n\t/usr/lib/golang/src/crypto/tls/handshake_server.go:54\ncrypto/tls.(*Conn).handshakeContext\n\t/usr/lib/golang/src/crypto/tls/conn.go:1490\ncrypto/tls.(*Conn).HandshakeContext\n\t/usr/lib/golang/src/crypto/tls/conn.go:1433\nnet/http.(*conn).serve\n...
2023/06/26 22:40:34 http: TLS handshake error from 10.113.225.194:37373: secret "tekton-triggers-core-interceptors-certs" not found
{"level":"error","ts":1687819244.5725904,"caller":"server/server.go:297","msg":"failed to fetch secret secret \"tekton-triggers-core-interceptors-certs\" not found","stacktrace":"github.com/tektoncd/triggers/pkg/interceptors/server.GetTLSData\n\t/go/src/github.com/tektoncd/triggers/pkg/interceptors/server/server.go:297\nmain.startServer.func2\n\t/go/src/github.com/tektoncd/triggers/cmd/interceptors/main.go:114\ncrypto/tls.(*Config).getCertificate\n\t/usr/lib/golang/src/crypto/tls/common.go:1073\ncrypto/tls.(*serverHandshakeStateTLS13).pickCertificate\n\t/usr/lib/golang/src/crypto/tls/handshake_server_tls13.go:368\ncrypto/tls.(*serverHandshakeStateTLS13).handshake\n\t/usr/lib/golang/src/crypto/tls/handshake_server_tls13.go:55\ncrypto/tls.(*Conn).serverHandshake\n\t/usr/lib/golang/src/crypto/tls/handshake_server.go:54\ncrypto/tls.(*Conn).handshakeContext\n\t/usr/lib/golang/src/crypto/tls/conn.go:1490\ncrypto/tls.(*Conn).HandshakeContext\n\t/usr/lib/golang/src/crypto/tls/conn.go:1433\nnet/http.(*conn).serve\n...
2023/06/26 22:40:44 http: TLS handshake error from 10.113.225.194:15493: secret "tekton-triggers-core-interceptors-certs" not found
Secret is there, but it is empty:
kind: Secret
apiVersion: v1
metadata:
annotations:
operator.tekton.dev/last-applied-hash: 690e03e6f63f2ea2c6aef2cb04bf95873bf0885667c29e58090578550868c439
resourceVersion: '25482'
name: tekton-triggers-core-interceptors-certs
uid: 2c2b6f05-1fcf-4ff0-ac8a-375084ef4056
creationTimestamp: '2023-06-26T22:42:35Z'
namespace: openshift-pipelines
ownerReferences:
- apiVersion: operator.tekton.dev/v1alpha1
kind: TektonInstallerSet
name: trigger-main-static-94w48
uid: 953cb70b-1e8b-428a-8a53-c88ee1d86b5d
controller: true
blockOwnerDeletion: true
labels:
app.kubernetes.io/component: interceptors
app.kubernetes.io/instance: default
app.kubernetes.io/name: core-interceptors
app.kubernetes.io/part-of: tekton-triggers
operator.tekton.dev/operand-name: tektoncd-triggers
triggers.tekton.dev/release: v0.24.1
type: Opaque
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle rotten
Send feedback to tektoncd/plumbing.
Expected Behavior
Have Openshift Pipelines operator installed succesfully
Actual Behavior
Openshif Pipelines operator never gets a succesfull state
Steps to Reproduce the Problem
This issue happens intermitently, unfortunately. The steps to reproduce it is really just install latest Openshift Pipelines (channel is currently using
openshift-pipelines-operator-rh.v1.11.0
bundle).Additional Info
Openshift 4.12
Output of
oc version
:Tekton Pipeline version
Output of
tkn version
:Output of
kubectl get pods -n openshift-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
Output of
tkn version
orkubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
Initial analysis of the problem
Secret
tekton-triggers-core-interceptors-certs
, inopenshift-pipelines
namespace, is created empty:Compare this with secret created in an environment where pipeline operator was installed successfully:
Absence of the data in this secre prevents
TektonTrigger
reconciliation, which preventsTektonConfig
to be installed successfully... then pipeline is not installed - issues causes some TLS handshake issues intekton-triggers-core-interceptors
Deployment's pod.Important: we are installing Openshift PIpelines in IBM Cloud Red Hat Openshift Service cluster. We have been using this for a long time alreay, and this issue used to happen sometimes - once every two weeks perhaps - but recently these problems are getting more frequent.
Thoughts on why secret is creating without any data?