Open alaypatel07 opened 4 years ago
update: I ran the following command which unblocked the migrations
$ for i in $(oc get pods | grep restic | awk '{ print $1 }'); do oc delete pod $i; done
Migmigration status:
resourceVersion: "13038345"
selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/557cf990-e0a9-11ea-8704-0789fb59f58b
uid: dd4319de-1c1f-45c5-8a47-944f1403a23d
spec:
migPlanRef:
name: migplan-sample-b-3
namespace: openshift-migration
stage: true
status:
conditions:
- category: Advisory
durable: true
lastTransitionTime: "2020-08-17T17:37:52Z"
message: The migration has completed successfully.
reason: Completed
status: "True"
type: Succeeded
itenerary: Stage
observedDigest: 8c0598cd06d3d55f4378d927d5350ee82216fe2b68e52e3d125ef1b5f238d25a
phase: Completed
startTimestamp: "2020-08-17T16:47:27Z"
@alaypatel07 if Restic is unauthorized to list Pods
, PVCs
and PVs
does this mean that Restic's SA token isn't properly authorized? Perhaps this SA token is getting rotated and Restic has a stale SA token? Can we reproduce this issue reliably, and if so what are the steps?
It looks like Restic is using a token called velero-token-2lshb
on my cluster with mig-operator latest.
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: velero-token-2lshb
It appears SA token velero-token-2lshb
comes from either OLM or mig-operator from the velero
SA which is designated on the Velero pod
serviceAccountName: velero
securityContext:
runAsUser: 0
@jmontleon can you comment on above? Where does the Velero SA / SA token come from?
@djwhatle may be it comes from olm https://github.com/konveyor/mig-operator/blob/228686dc55dfb923af044913e50f456e0d2ee96b/deploy/olm-catalog/konveyor-operator/v1.2.4/konveyor-operator.v1.2.4.clusterserviceversion.yaml#L631 and I did a couple of custom deployments of mig-controller. Could be why it might have stale token and hence RBAC problems. I'll wait for @jmontleon to confirm, but if this is true, two follow up questions:
@djwhatle I think the way to reproduce this would be
I havent tried this but based on the theory and my experience, the above will likely reproduce this
OLM. It's possible on an upgrade/reinstall, but otherwise I wouldn't expect it to be changing
It might be possible to do a k8s info/facts query for the sa and pods and if the pods are older than the SA delete them (so they are recreated) when the operator is reconciling.
I think is being addressed by adding progress reporting in #692
@djwhatle I think we should handle this from mig-operator. I am not sure how this is being handled in the progress reporting PR. can I re-open this and submit a PR in mig-operator to do the following?
It might be possible to do a k8s info/facts query for the sa and pods and if the pods are older than the SA delete them (so they are recreated) when the operator is reconciling.
@alaypatel07 yeah sure, feel free to re-open anything you see that you feel isn't resolved. The mig-operator solution seems reasonable here.
It looks like deleting the operator in OLM deletes the velero SA. When you reinstall the SA gets recreated.
Deleting the MigrationController CR before uninstalling the operator will clean up the operands
I'm going to file a docs BZ to make sure Jason's process is included. Do we think there's a bug here at all source-wise that needs to get fixed? Doesn't seem like it but want to confirm.
@alaypatel07 @jmontleon
One of the examples of silent failure where CAM seems to be something and isn't moving forward for a significant amount of time.
The migmigration is stuck on StageRestoreCreated phase since ~30 mins or so.
The restore is in progress. The velero logs suggest that it is waiting for restic
All three restic pods show the following logs:
Logging this as an issue so we can think on how to bubble up this error into CAM so that it is evident to user and user can ask for next steps