operator-framework / operator-lifecycle-manager

A management framework for extending Kubernetes with Operators
https://olm.operatorframework.io
Apache License 2.0
1.7k stars 542 forks source link

Temporary outage in conversion webhooks while upgrading operator #2891

Open amisevsk opened 1 year ago

amisevsk commented 1 year ago

Bug Report

What did you do?

Upgrade OLM-installed operator that serves conversion webhooks. To reproduce:

  1. Create subscription for DevWorkspace Operator
    cat <<EOF | oc apply -f -                                                                                                                          
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: devworkspace
      namespace: openshift-operators
    spec:
      channel: fast
      installPlanApproval: Manual
      name: devworkspace-operator
      source: redhat-operators
      sourceNamespace: openshift-marketplace
      startingCSV: devworkspace-operator.v0.16.0
    EOF
  2. Approve initial install onto cluster (version v0.16.0)
  3. Create DevWorkspace CR to trigger webhooks: oc apply -f https://github.com/devfile/devworkspace-operator/raw/main/samples/plain.yaml
  4. Open terminal and begin calling conversion webhooks in a loop: while true; do oc get devworkspaces.v1alpha1.workspace.devfile.io; sleep 0.5; done
  5. Approve update to Operator in OpenShift console (version v0.16.0-0.1666668361.p)

What did you expect to see?

Webhooks continue being served as operator deployment is rolled out to a new version.

What did you see instead? Under which circumstances?

Brief periods where conversion webhooks are unavailable during upgrade. While upgrade is in process, oc get command from reproducer logs

Error from server (InternalError): Internal error occurred: error resolving resource

and

Error from server: conversion webhook for workspace.devfile.io/v1alpha2, Kind=DevWorkspace failed: Post "https://devworkspace-controller-manager-service.openshift-operators.svc:443/convert?timeout=30s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")

Operator CSV gets condition

{
    "lastTransitionTime": "2022-11-07T01:36:40Z",
    "lastUpdateTime": "2022-11-07T01:36:40Z",
    "message": "calculated deployment install is bad",
    "phase": "Pending",
    "reason": "NeedsReinstall"
},

Conversion webhooks breaking causes the cluster to register as unstable, which is a potential issue for monitoring.

Environment

Possible Solution Potentially an issue around certificates attached to conversion webhooks as CRDs are updated?

Additional context Add any other context about the problem here.

joelanford commented 1 year ago

@perdasilva potentially related to the service deletion bug you've been looking at?