restore stuck on resticrestore

alaypatel07 commented 4 years ago

One of the examples of silent failure where CAM seems to be something and isn't moving forward for a significant amount of time.

$ oc get migmiration 557cf990-e0a9-11ea-8704-0789fb59f58b -oyaml
status:
  conditions:
  - category: Advisory
    lastTransitionTime: "2020-08-17T16:54:06Z"
    message: 'Step: 18/22'
    reason: StageRestoreCreated
    status: "True"
    type: Running
  - category: Required
    lastTransitionTime: "2020-08-17T16:47:42Z"
    message: The migration is ready.
    status: "True"
    type: Ready
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-08-17T16:50:02Z"
    message: '[1] Stage pods created.'
    status: "True"
    type: StagePodsCreated
  itenerary: Stage
  observedDigest: 8c0598cd06d3d55f4378d927d5350ee82216fe2b68e52e3d125ef1b5f238d25a
  phase: StageRestoreCreated
  startTimestamp: "2020-08-17T16:47:27Z"

The migmigration is stuck on StageRestoreCreated phase since ~30 mins or so.

apiVersion: velero.io/v1
kind: Restore
metadata:
  annotations:
    openshift.io/migrate-copy-phase: stage
    openshift.io/migration-registry: 172.30.162.97:5000
    openshift.io/migration-registry-dir: /migstorage-sample-3-registry-d535fc93-18db-4b73-852e-d6a86409561b
  creationTimestamp: "2020-08-17T16:54:02Z"
  generateName: 557cf990-e0a9-11ea-8704-0789fb59f58b-
  generation: 2
  labels:
    app.kubernetes.io/part-of: openshift-migration
    migmigration: dd4319de-1c1f-45c5-8a47-944f1403a23d
    migration-stage-restore: dd4319de-1c1f-45c5-8a47-944f1403a23d
  name: 557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
  namespace: openshift-migration
  resourceVersion: "13024034"
  selfLink: /apis/velero.io/v1/namespaces/openshift-migration/restores/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
  uid: fd1b8f61-9fce-47d6-860c-e4e0dfcd42cc
spec:
  backupName: 557cf990-e0a9-11ea-8704-0789fb59f58b-f9hbd
  excludedResources:
  - serviceinstances.servicecatalog.k8s.io
  - nodes
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  - resticrepositories.velero.io
  restorePVs: true
status:
  phase: InProgress

The restore is in progress. The velero logs suggest that it is waiting for restic

$ oc logs -f  velero-74579bf545-5nk2q | grep 557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
...
...
...
time="2020-08-17T16:54:10Z" level=info msg="[common-restore] Entering common migration restore plugin" cmd=/plugins/velero-plugins logSource="/go/src/github.com/konveyor/openshift-migration-plugin/velero-plugins/migcommon/restore.go:24" pluginName=velero-plugins restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:10Z" level=info msg="[common-restore] common migration restore plugin for musicstore:latest" cmd=/plugins/velero-plugins logSource="/go/src/github.com/konveyor/openshift-migration-plugin/velero-plugins/migcommon/restore.go:35" pluginName=velero-plugins restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:10Z" level=info msg="Executing item action for imagestreamtags.image.openshift.io" logSource="pkg/restore/restore.go:1030" restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:10Z" level=info msg="[istag-restore] Entering ImageStreamTag restore plugin" cmd=/plugins/velero-plugins logSource="/go/src/github.com/konveyor/openshift-migration-plugin/velero-plugins/migimagestreamtag/restore.go:31" pluginName=velero-plugins restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:10Z" level=info msg="[istag-restore] Restoring imagestreamtag musicstore:latest" cmd=/plugins/velero-plugins logSource="/go/src/github.com/konveyor/openshift-migration-plugin/velero-plugins/migimagestreamtag/restore.go:41" pluginName=velero-plugins restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:10Z" level=info msg="[istag-restore] backup internal registry: \"docker-registry.default.svc:5000\"" cmd=/plugins/velero-plugins logSource="/go/src/github.com/konveyor/openshift-migration-plugin/velero-plugins/migimagestreamtag/restore.go:44" pluginName=velero-plugins restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:10Z" level=info msg="[istag-restore] Local image: docker-registry.default.svc:5000/image-playground/musicstore@sha256:c398db4e73212f80e72bc23f9357850b60d835f4314c6086ef08da68f013d065" cmd=/plugins/velero-plugins logSource="/go/src/github.com/konveyor/openshift-migration-plugin/velero-plugins/migimagestreamtag/restore.go:48" pluginName=velero-plugins restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:10Z" level=info msg="[istag-restore] Not restoring local imagestreamtag" cmd=/plugins/velero-plugins logSource="/go/src/github.com/konveyor/openshift-migration-plugin/velero-plugins/migimagestreamtag/restore.go:137" pluginName=velero-plugins restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:10Z" level=info msg="Skipping restore of ImageStreamTag: musicstore:latest because a registered plugin discarded it" logSource="pkg/restore/restore.go:1043" restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:17Z" level=info msg="Not including resource" groupResource=nodes logSource="pkg/restore/restore.go:138" restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:17Z" level=info msg="Not including resource" groupResource=events logSource="pkg/restore/restore.go:138" restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:17Z" level=info msg="Not including resource" groupResource=events.events.k8s.io logSource="pkg/restore/restore.go:138" restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:17Z" level=info msg="Not including resource" groupResource=backups.velero.io logSource="pkg/restore/restore.go:138" restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:17Z" level=info msg="Not including resource" groupResource=resticrepositories.velero.io logSource="pkg/restore/restore.go:138" restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:17Z" level=info msg="Not including resource" groupResource=restores.velero.io logSource="pkg/restore/restore.go:138" restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k
time="2020-08-17T16:54:17Z" level=info msg="Waiting for all restic restores to complete" logSource="pkg/restore/restore.go:545" restore=openshift-migration/557cf990-e0a9-11ea-8704-0789fb59f58b-9hk7k

All three restic pods show the following logs:

E0817 17:32:17.661201       1 reflector.go:156] github.com/vmware-tanzu/velero/pkg/cmd/cli/restic/server.go:196: Failed to list *v1.Pod: Unauthorized
E0817 17:32:17.784321       1 reflector.go:156] github.com/vmware-tanzu/velero/pkg/generated/informers/externalversions/factory.go:117: Failed to list *v1.PodVolumeBackup: Unauthorized
E0817 17:32:18.041672       1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.PersistentVolumeClaim: Unauthorized
E0817 17:32:18.079014       1 reflector.go:156] github.com/vmware-tanzu/velero/pkg/generated/informers/externalversions/factory.go:117: Failed to list *v1.PodVolumeRestore: Unauthorized
E0817 17:32:18.297697       1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.PersistentVolume: Unauthorized
E0817 17:32:18.336272       1 reflector.go:156] github.com/vmware-tanzu/velero/pkg/cmd/cli/restic/server.go:197: Failed to list *v1.Secret: Unauthorized
E0817 17:32:18.354345       1 reflector.go:156] github.com/vmware-tanzu/velero/pkg/generated/informers/externalversions/factory.go:117: Failed to list *v1.BackupStorageLocation: Unauthorized

Logging this as an issue so we can think on how to bubble up this error into CAM so that it is evident to user and user can ask for next steps

alaypatel07 commented 4 years ago

update: I ran the following command which unblocked the migrations

 $ for i in $(oc get pods | grep restic | awk '{ print $1 }'); do oc delete pod $i; done

Migmigration status:

  resourceVersion: "13038345"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/557cf990-e0a9-11ea-8704-0789fb59f58b
  uid: dd4319de-1c1f-45c5-8a47-944f1403a23d
spec:
  migPlanRef:
    name: migplan-sample-b-3
    namespace: openshift-migration
  stage: true
status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-08-17T17:37:52Z"
    message: The migration has completed successfully.
    reason: Completed
    status: "True"
    type: Succeeded
  itenerary: Stage
  observedDigest: 8c0598cd06d3d55f4378d927d5350ee82216fe2b68e52e3d125ef1b5f238d25a
  phase: Completed
  startTimestamp: "2020-08-17T16:47:27Z"

djwhatle commented 4 years ago

@alaypatel07 if Restic is unauthorized to list Pods, PVCs and PVs does this mean that Restic's SA token isn't properly authorized? Perhaps this SA token is getting rotated and Restic has a stale SA token? Can we reproduce this issue reliably, and if so what are the steps?

It looks like Restic is using a token called velero-token-2lshb on my cluster with mig-operator latest.

    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: velero-token-2lshb

It appears SA token velero-token-2lshb comes from either OLM or mig-operator from the velero SA which is designated on the Velero pod

      serviceAccountName: velero
      securityContext:
        runAsUser: 0

djwhatle commented 4 years ago

@jmontleon can you comment on above? Where does the Velero SA / SA token come from?

alaypatel07 commented 4 years ago

@djwhatle may be it comes from olm https://github.com/konveyor/mig-operator/blob/228686dc55dfb923af044913e50f456e0d2ee96b/deploy/olm-catalog/konveyor-operator/v1.2.4/konveyor-operator.v1.2.4.clusterserviceversion.yaml#L631 and I did a couple of custom deployments of mig-controller. Could be why it might have stale token and hence RBAC problems. I'll wait for @jmontleon to confirm, but if this is true, two follow up questions:

Is this something that we expect customers to do? try to install/remove and re-install CAM. If so they probably will run into this stale perms on velero/restic pods
It looks like the problem is, velero is installed with old SA token, CAM is re-installed, mig-operator finds that deployment okay and does not re-create it. I am not sure how can we force re-deployment of velero in a graceful manner

@djwhatle I think the way to reproduce this would be

install konveyor
create migration-controller resource to deploy velero
change RBAC of something in mig-operator
uninstall and install konveyor again
check velero pod logs

I havent tried this but based on the theory and my experience, the above will likely reproduce this

jmontleon commented 4 years ago

OLM. It's possible on an upgrade/reinstall, but otherwise I wouldn't expect it to be changing

jmontleon commented 4 years ago

It might be possible to do a k8s info/facts query for the sa and pods and if the pods are older than the SA delete them (so they are recreated) when the operator is reconciling.

djwhatle commented 4 years ago

I think is being addressed by adding progress reporting in #692

alaypatel07 commented 3 years ago

@djwhatle I think we should handle this from mig-operator. I am not sure how this is being handled in the progress reporting PR. can I re-open this and submit a PR in mig-operator to do the following?

It might be possible to do a k8s info/facts query for the sa and pods and if the pods are older than the SA delete them (so they are recreated) when the operator is reconciling.

djwhatle commented 3 years ago

@alaypatel07 yeah sure, feel free to re-open anything you see that you feel isn't resolved. The mig-operator solution seems reasonable here.

jmontleon commented 3 years ago

It looks like deleting the operator in OLM deletes the velero SA. When you reinstall the SA gets recreated.

Deleting the MigrationController CR before uninstalling the operator will clean up the operands

backup and delete the MigrationController CR
uninstall the operator
reinstall the operator
recreate the MigrationController CR

eriknelson commented 3 years ago

I'm going to file a docs BZ to make sure Jason's process is included. Do we think there's a bug here at all source-wise that needs to get fixed? Doesn't seem like it but want to confirm.

@alaypatel07 @jmontleon

eriknelson commented 3 years ago

Docs BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1897654

migtools / mig-controller

restore stuck on resticrestore #631