openshift / openshift-velero-plugin

General Velero plugin for backup and restore of openshift workloads.
Apache License 2.0
48 stars 38 forks source link

Backup hangs on [GetRegistryInfo] value from imagestream #113

Closed gorantornqvist closed 2 years ago

gorantornqvist commented 2 years ago

Hi, Using oadp-0.5.3 and a DPA configured with 2 backupLocations:

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
  name: sf-dpa
  namespace: openshift-adp
spec:
  configuration:
    velero:
      defaultPlugins:
      - openshift
      - aws
      - csi
      featureFlags:
      - EnableCSI
    restic:
      enable: true
  backupImages: true
  backupLocations:
    - name: sf-dpa-1
      velero:
        provider: aws
        default: true
        credential:
          key: sf-ocp
          name: cloud-credentials
        objectStorage:
          bucket: oadp-backups-nimbus-ec-u12-dr01
          prefix: nimbus
        config:
          insecureSkipTLSVerify: "true"
          profile: default
          region: us-east-1
          s3ForcePathStyle: "true"
          s3Url: https://s3-netapp-storagegrid-fqdn
          signatureVersion: "4"
    - name: sf-dpa-2
      velero:
        provider: aws
        default: false
        credential:
          key: demo-customer
          name: cloud-credentials
        objectStorage:
          bucket: nimbus-demo-customer-ec-u12-dr01
          prefix: nimbus
        config:
          insecureSkipTLSVerify: "true"
          profile: default
          region: us-east-1
          s3ForcePathStyle: "true"
          s3Url: https://s3-netapp-storagegrid-fqdn
          signatureVersion: "4"

Created a backup:

apiVersion: velero.io/v1
kind: Backup
metadata:
  namespace: openshift-adp
  generateName: demo-customer-backup-
spec:
  defaultVolumesToRestic: true
  storageLocation: sf-dpa-2
  includedNamespaces:
    - demo-customer

Also enabled --log-level debug with velero pod I found that the backup appears to hang in openshift-velero-plugin:

time="2021-12-30T07:52:52Z" level=info msg="[common-backup] Entering common backup plugin" backup=openshift-adp/demo-customer-backup-z9gcj cmd=/plugins/velero-plugins logSource="/opt/app-root/src/github.com/konveyor/openshift-velero-plugin/velero-plugins/common/backup.go:39" pluginName=velero-plugins
time="2021-12-30T07:52:52Z" level=info msg="[GetRegistryInfo] value from imagestream" backup=openshift-adp/demo-customer-backup-z9gcj cmd=/plugins/velero-plugins logSource="/opt/app-root/src/github.com/konveyor/openshift-velero-plugin/velero-plugins/common/shared.go:36" pluginName=velero-plugins
time="2021-12-30T07:52:52Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-adp/demo-customer-backup-z9gcj logSource="pkg/backup/item_backupper.go:308" name=demo-customer-backend-86b7cfbb9f-psbxm namespace=demo-customer resource=pods
time="2021-12-30T07:52:52Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-adp/demo-customer-backup-z9gcj logSource="pkg/backup/item_backupper.go:308" name=demo-customer-backend-86b7cfbb9f-psbxm namespace=demo-customer resource=pods
time="2021-12-30T07:52:52Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-adp/demo-customer-backup-z9gcj logSource="pkg/backup/item_backupper.go:308" name=demo-customer-backend-86b7cfbb9f-psbxm namespace=demo-customer resource=pods
time="2021-12-30T07:52:52Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-adp/demo-customer-backup-z9gcj logSource="pkg/backup/item_backupper.go:308" name=demo-customer-backend-86b7cfbb9f-psbxm namespace=demo-customer resource=pods
time="2021-12-30T07:52:52Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-adp/demo-customer-backup-z9gcj logSource="pkg/backup/item_backupper.go:308" name=demo-customer-backend-86b7cfbb9f-psbxm namespace=demo-customer resource=pods
time="2021-12-30T07:52:52Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-adp/demo-customer-backup-z9gcj logSource="pkg/backup/item_backupper.go:308" name=demo-customer-backend-86b7cfbb9f-psbxm namespace=demo-customer resource=pods
time="2021-12-30T07:52:52Z" level=debug msg="Acquiring lock" backupLocation=sf-dpa-2 logSource="pkg/restic/repository_ensurer.go:122" volumeNamespace=demo-customer
time="2021-12-30T07:52:52Z" level=debug msg="Acquired lock" backupLocation=sf-dpa-2 logSource="pkg/restic/repository_ensurer.go:131" volumeNamespace=demo-customer
time="2021-12-30T07:52:52Z" level=debug msg="Ready repository found" backupLocation=sf-dpa-2 logSource="pkg/restic/repository_ensurer.go:147" volumeNamespace=demo-customer
time="2021-12-30T07:52:52Z" level=debug msg="Released lock" backupLocation=sf-dpa-2 logSource="pkg/restic/repository_ensurer.go:128" volumeNamespace=demo-customer

Hanging ...

Can also see that the PodVolumeBackup is stuck and doesnt progress:

$ oc describe PodVolumeBackup demo-customer-backup-z9gcj-n6bj8
Name:         demo-customer-backup-z9gcj-n6bj8
Namespace:    openshift-adp
Labels:       velero.io/backup-name=demo-customer-backup-z9gcj
              velero.io/backup-uid=57465526-2181-4e7d-8fcc-5b61af769c58
              velero.io/pvc-uid=875b84a8-84c6-4dec-94a8-6a1f2e286e18
Annotations:  velero.io/pvc-name: demo-customer-backend-pvc
API Version:  velero.io/v1
Kind:         PodVolumeBackup
Metadata:
  Creation Timestamp:  2021-12-30T07:52:52Z
  Generate Name:       demo-customer-backup-z9gcj-
  Generation:          1
    Manager:    velero-server
    Operation:  Update
    Time:       2021-12-30T07:52:52Z
  Owner References:
    API Version:     velero.io/v1
    Controller:      true
    Kind:            Backup
    Name:            demo-customer-backup-z9gcj
    UID:             57465526-2181-4e7d-8fcc-5b61af769c58
  Resource Version:  261201462
  UID:               fba2cc44-9e71-4709-b522-d44fa301b43d
Spec:
  Backup Storage Location:  sf-dpa-2
  Node:                     mycluster-x4rx9-worker-5hj47
  Pod:
    Kind:           Pod
    Name:           demo-customer-backend-86b7cfbb9f-psbxm
    Namespace:      demo-customer
    UID:            1917842f-5308-4180-b016-4e9ae0437daf
  Repo Identifier:  s3:https://s3-netapp-storagegrid-fqdn/nimbus-demo-customer-ec-u12-dr01/nimbus/restic/demo-customer
  Tags:
    Backup:        demo-customer-backup-z9gcj
    Backup - UID:  57465526-2181-4e7d-8fcc-5b61af769c58
    Ns:            demo-customer
    Pod:           demo-customer-backend-86b7cfbb9f-psbxm
    Pod - UID:     1917842f-5308-4180-b016-4e9ae0437daf
    Pvc - UID:     875b84a8-84c6-4dec-94a8-6a1f2e286e18
    Volume:        demo-customer-backend-storage
  Volume:          demo-customer-backend-storage
Status:
  Progress:
Events:  <none>

(The pod demo-customer-backend-86b7cfbb9f-psbxm has annotations "backup.velero.io/backup-volumes: demo-customer-backend-storage" set)

When checking the logs for oadp-sf-dpa-2-aws-registry I cant see anything relevant ... According to Docs I should see that the "is-backup" starts but it doesnt...

Any suggestions how to troubleshoot?

gorantornqvist commented 2 years ago

Sorry, there was a nodeSelector on the oadp-operator namespace causing the restic pods to not be scheduled on the worker nodes, doh!