openshift / openshift-velero-plugin

General Velero plugin for backup and restore of openshift workloads.
Apache License 2.0
47 stars 37 forks source link

OADP-3598 Dest NS Skip waiting for dockercfg secret if default SA do not need it. #228

Closed kaovilai closed 3 months ago

kaovilai commented 4 months ago

What: we were waiting for secret when it may never appears How: ocp 4.14+ have ability to turn off imageregistry which can affect secret from appearing Fix: check if secret is needed using same code used to determine if secret will be created

Test method: install oadp dpa with restic enabled for pod workload with ownerRefs (ie. Deployments) or lone pods for restic off. test workload ``` oc create ns test-new-ns oc run --image fedora --tty -n test-new-ns -i --attach=false --command sh ``` backup ``` velero backup create testnsbackup --include-namespaces=test-new-ns ``` restore ``` velero restore create --from-backup=testnsbackup --namespace-mappings=test-new-ns:test-new-ns-mapping ``` **With ImageRegistry capability disabled:**
Install 4.15 [Prepare aws account](https://docs.openshift.com/container-platform/4.14/installing/installing_aws/installing-aws-account.html#:~:text=If%20you%20are%20using%20a,an%20example%20high%20level%20procedure.) Get openshift-install cli from console.redhat.com ``` ❯ openshift-install version openshift-install 4.15.0 built from commit e8c70b7db9e64cafa7fe3182c365257c3e53fbb6 release image quay.io/openshift-release-dev/ocp-release@sha256:0da6316466d60a3a4535d5fed3589feb0391989982fba59d47d4c729912d6363 release architecture amd64 ``` Create install config with [customizations](https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installing-aws-customizations.html#installation-initializing_installing-aws-customizations) ``` ~/ocp-installer ❯ openshift-install create install-config --dir customized-install-dir ``` Adding the following to customized-install-dir/install-config.yaml We're adding [capabilities](https://docs.openshift.com/container-platform/4.15/installing/cluster-capabilities.html#:~:text=v4.15-,Specify%20this%20option%20when%20you%20want%20to%20enable%20the%20default%20capabilities,CSISnapshot%2C%20NodeTuning%2C%20ImageRegistry%2C%20Build%2C%20and%20DeploymentConfig.,-None) all except ImageRegistry ```yaml capabilities: baselineCapabilitySet: None additionalEnabledCapabilities: - baremetal - MachineAPI - marketplace - OperatorLifecycleManager - openshift-samples - Console - Insights - Storage - CSISnapshot - CloudCredential # added from error, not documented on docs.openshift.com for 4.15 bug: https://issues.redhat.com/browse/OCPBUGS-30610 - NodeTuning # - ImageRegistry - Build - DeploymentConfig ``` ``` unset SSH_AUTH_SOCK # potentially needed for macos issue https://github.com/openshift/installer/issues/2349#issuecomment-534690793 openshift-install create cluster --dir customized-install-dir-4 --log-level=info ``` monitor install logs in another terminal ``` tail -f customized-install-dir-3/.openshift_install.log ```
Pre-requisites check Checking [enabled capabilities](https://docs.openshift.com/container-platform/4.15//installing/cluster-capabilities.html#:~:text=The%20default%20capabilities%20in%20OpenShift%20Container%20Platform%204.15,Insights%2C%20Storage%2C%20CSISnapshot%2C%20NodeTuning%2C%20ImageRegistry%2C%20Build%2C%20and%20DeploymentConfig.) ``` ~/ocp-installer ❯ oc get clusterversion version -o yaml | yq .status.capabilities.enabledCapabilities | grep ImageRegistry ``` Check sa for secret *NOT* referencing default-dockercfg- ``` ❯ oc get sa default -n default -oyaml apiVersion: v1 kind: ServiceAccount metadata: creationTimestamp: "2024-03-08T03:09:50Z" name: default namespace: default resourceVersion: "602" uid: 0843a945-ce0e-4e82-9cfb-7d0f2bcd1abc ```
**Restore log without fix (v1.3.0)** Restore stuck InProgress until PartiallyFailed with error ``` time="2024-03-08T20:23:05Z" level=error msg="Namespace test-new-ns-mapping, resource restore error: error preparing pods/test-new-ns-mapping/sh: rpc error: code = Unknown desc = Secret is not getting created" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:560" restore=openshift-adp/testnsbackup-20240308151805 ``` Installing fix Override plugin image used by changing oadp's CSV manager env to use `'ghcr.io/kaovilai/openshift-velero-plugin:[OADP-3598](https://issues.redhat.com//browse/OADP-3598)'` (unsupportedOverrides do not work in v1.3.0) ``` - name: RELATED_IMAGE_OPENSHIFT_VELERO_PLUGIN value: 'ghcr.io/kaovilai/openshift-velero-plugin:OADP-3598' ``` Double check velero pods now point to new ocp plugin container retry restore Expected: don't see error Actual: TBD Signed-off-by: Tiger Kaovilai
kaovilai commented 4 months ago

/cherry-pick oadp-1.3

openshift-cherrypick-robot commented 4 months ago

@kaovilai: once the present PR merges, I will cherry-pick it on top of oadp-1.3 in a new PR and assign it to you.

In response to [this](https://github.com/openshift/openshift-velero-plugin/pull/228#issuecomment-1981921245): >/cherry-pick oadp-1.3 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
weshayutin commented 4 months ago

Thanks for jumping on this :)

kaovilai commented 4 months ago

/hold Testing with 4.15 cluster

shubham-pampattiwar commented 3 months ago

Tested the PR on OCP 4.15 with cluster capability ImageRegistry disabled:

shubham@fedora ~/oadp-operator -  (master) $ velero get restores
NAME                            BACKUP           STATUS      STARTED                         COMPLETED                       ERRORS   WARNINGS   CREATED                         SELECTOR
testnsbackup-1-20240312132828   testnsbackup-1   Completed   2024-03-12 13:28:29 -0700 PDT   2024-03-12 13:28:30 -0700 PDT   0        5          2024-03-12 13:28:29 -0700 PDT   <none>
shubham@fedora ~/oadp-operator -  (master) $ oc get po -n test-new-ns-mapping-1
NAME   READY   STATUS    RESTARTS   AGE
sh     1/1     Running   0          2m4s
shubham@fedora ~/oadp-operator -  (master) $ velero restore logs testnsbackup-1-20240312132828 | grep -i 'level=error'
shubham@fedora ~/oadp-operator -  (master) $ oc version
Client Version: 4.14.7
Kustomize Version: v5.0.1
Server Version: 4.15.0
Kubernetes Version: v1.28.6+6216ea1
shubham@fedora ~/oadp-operator -  (master) $ oc get deploy/velero -n openshift-adp | grep openshift-velero-plugin
shubham@fedora ~/oadp-operator -  (master) $ oc get deploy/velero -n openshift-adp -oyaml | grep openshift-velero-plugin
      - image: quay.io/spampatt/openshift-velero-plugins:fix-3598
        name: openshift-velero-plugin
shubham@fedora ~/oadp-operator -  (master) $
shubham@fedora ~/oadp-operator -  (master) $ oc get clusterversion version -oyaml | grep -A 15 "enabledCapabilities"
    enabledCapabilities:
    - Build
    - CSISnapshot
    - CloudCredential
    - Console
    - DeploymentConfig
    - Insights
    - MachineAPI
    - NodeTuning
    - OperatorLifecycleManager
    - Storage
    - baremetal
    - marketplace
    - openshift-samples
    knownCapabilities:
    - Build
sseago commented 3 months ago

Looks like there are several now-unused funcs from earlier iterations of this that should probably be removed (around image registry capability and SA secrets). Also, I don't think we're currently checking OCP version -- we probably want to check this and only skip the waiting if are both missing image registry replicas and we're on 4.15 or later, since I suspect that these secrets still exist (and may still be required) in earlier OCP versions even if the registry is down.

shubham-pampattiwar commented 3 months ago

@sseago added OCP version check and removed unused code, PTAL, Thanks !

openshift-ci[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai, sseago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/openshift-velero-plugin/blob/master/OWNERS)~~ [kaovilai,sseago] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
sseago commented 3 months ago

/lgtm

openshift-ci[bot] commented 3 months ago

@kaovilai: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
shubham-pampattiwar commented 3 months ago

/unhold

shubham-pampattiwar commented 3 months ago

/cherry-pick oadp-1.3

openshift-cherrypick-robot commented 3 months ago

@shubham-pampattiwar: once the present PR merges, I will cherry-pick it on top of oadp-1.3 in a new PR and assign it to you.

In response to [this](https://github.com/openshift/openshift-velero-plugin/pull/228#issuecomment-1998536933): >/cherry-pick oadp-1.3 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-cherrypick-robot commented 3 months ago

@kaovilai: new pull request created: #235

In response to [this](https://github.com/openshift/openshift-velero-plugin/pull/228#issuecomment-1981921245): >/cherry-pick oadp-1.3 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.