Closed JannikZed closed 2 years ago
Is it actually running on a single restore attempt for so long? Its a controller that retries on failure, so if it fails half way, it will just restart after back-off. We'll need the failure or else the full log to diagnose what is going on. Please also share the installed chart version (if possible the version that was used to create the backup and the version that is used to restore the backup).
I'm trying to debug it further. As both clusters are up and running, I can test it easily. Rancher backup chart version: 2.1.2 Rancher Recover chart version: 2.1.2
Some more details on the restore:
kubectl get restore restore-migration -o yaml
apiVersion: resources.cattle.io/v1
kind: Restore
metadata:
annotations:
creationTimestamp: "2022-06-03T09:05:01Z"
generation: 1
name: restore-migration
resourceVersion: "228628"
uid: 2f5b0a52-4d8b-452d-96c5-3c9c9104484a
spec:
backupFilename: nightly-c8797ff5-6394-11e8-9072-0242ac110002-2022-06-03T00-00-00Z.tar.gz
prune: false
storageLocation:
s3:
bucketName: XXXX
credentialSecretName: s3-creds
credentialSecretNamespace: default
endpoint: storage.googleapis.com
region: null
status:
backupSource: ""
conditions:
- lastUpdateTime: "2022-06-03T10:12:46Z"
message: error restoring namespaced resources, check logs for exact error
reason: Error
status: "False"
type: Reconciling
- lastUpdateTime: "2022-06-03T10:12:46Z"
message: Retrying
status: Unknown
type: Ready
observedGeneration: 0
restoreCompletionTs: ""
summary: ""
So there seems to be an issue, but with that large amount of logs, it is hard to find the root-cause. Is there a way, to get the logs related to this error? With this little input, I can't find out, what namespace resources failed exactly.
Yes, we have reduced logs in the upcoming version because of this reason. For now you can grep for error or fail;
kubectl -n cattle-resources-system logs -l app.kubernetes.io/name=rancher-backup --tail=-1 | egrep -i "error|fail"
I can find these two errors:
kubectl -n cattle-resources-system logs -l app.kubernetes.io/name=rancher-backup --tail=-1 | egrep -i "error|fail"
INFO[2022/06/06 13:11:38] restoreResource: Restoring cluster-scan-manual-failure-only of type management.cattle.io/v3, Resource=clusteralertrules
INFO[2022/06/06 13:11:38] restoreResource: Namespace c-pbmqs for name cluster-scan-manual-failure-only of type management.cattle.io/v3, Resource=clusteralertrules
INFO[2022/06/06 13:11:38] Successfully restored cluster-scan-manual-failure-only
ERRO[2022/06/06 13:12:21] Error restoring resource fleet-agent-hetzner-eu-nbg of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err creating resource bundledeployments.fleet.cattle.io "fleet-agent-hetzner-eu-nbg" is forbidden: unable to create new content in namespace cluster-fleet-default-hetzner-eu-nbg-30430c2cbc38 because it is being terminated
INFO[2022/06/06 13:12:27] restoreResource: Restoring cluster-scan-scheduled--failure-only of type management.cattle.io/v3, Resource=clusteralertrules
INFO[2022/06/06 13:12:27] restoreResource: Namespace local for name cluster-scan-scheduled--failure-only of type management.cattle.io/v3, Resource=clusteralertrules
INFO[2022/06/06 13:12:27] Successfully restored cluster-scan-scheduled--failure-only
ERRO[2022/06/06 13:13:24] Error restoring resource request-vtcpq-66e4ec45-556d-4352-a61f-4261f6a52af2 of type /v1, Resource=serviceaccounts: restoreResource: err creating resource serviceaccounts "request-vtcpq-66e4ec45-556d-4352-a61f-4261f6a52af2" is forbidden: unable to create new content in namespace cluster-fleet-default-hetzner-eu-nbg-30430c2cbc38 because it is being terminated
I don't know, what this namespace is being used for and I could delete it in the origin cluster, backup again and check, if it is running then.
After cleaning the two problematic namespaces, I'm facing now these errors:
WARN[2022/06/06 20:42:59] Error getting object for controllerRef rancher, skipping it
ERRO[2022/06/06 20:42:59] Error restoring namespaced resources [error restoring cert-manager of type project.cattle.io/v3, Resource=apps: restoreResource: err updating resource App.project.cattle.io "cert-manager" is invalid: metadata.deletionGracePeriodSeconds: Invalid value: 0: field is immutable error restoring fleet-agent-local of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err updating status resource Operation cannot be fulfilled on bundledeployments.fleet.cattle.io "fleet-agent-local": the object has been modified; please apply your changes to the latest version and try again error restoring rocketchat of type project.cattle.io/v3, Resource=apps: restoreResource: err updating resource App.project.cattle.io "rocketchat" is invalid: metadata.deletionGracePeriodSeconds: Invalid value: 0: field is immutable]
ERRO[2022/06/06 20:42:59] error syncing 'restore-migration': handler restore: error restoring namespaced resources, check logs for exact error, requeuing
WARN[2022/06/06 20:43:01] Error getting object for controllerRef rancher, skipping it
Especially "rocketchat" is interesting - never used that before ..
That seems like an error similar to the one in https://github.com/rancher/backup-restore-operator/issues/188, can you share the resource json from the problematic namespaces here so I can check what the problem is on those?
I can see, that they are in the upstream server in status "removing" ..
Here is the corresponding yaml definition of cert-manager:
apiVersion: project.cattle.io/v3
kind: App
metadata:
annotations:
field.cattle.io/creatorId: user-4qxbk
lifecycle.cattle.io/create.helm-controller_c-t6lrq: "true"
creationTimestamp: "2018-11-22T11:39:43Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2019-08-25T18:51:22Z"
finalizers:
- clusterscoped.controller.cattle.io/helm-controller_c-t6lrq
generation: 2
labels:
cattle.io/creator: norman
name: cert-manager
namespace: p-9hlht
resourceVersion: "55421992"
uid: 4e1e5699-ee4b-11e8-86b4-0242ac110002
spec:
appRevisionName: apprevision-x5sxt
externalId: catalog://?catalog=helm&template=cert-manager&version=v0.5.2
projectName: c-t6lrq:p-9hlht
targetNamespace: cert-manager
status:
conditions:
- lastUpdateTime: "2018-11-22T11:39:43Z"
status: "True"
type: Migrated
- lastUpdateTime: "2018-11-22T11:39:45Z"
status: "True"
type: Installed
notes: |+
---
# Source: cert-manager/templates/NOTES.txt
cert-manager has been deployed successfully!
In order to begin issuing certificates, you will need to set up a ClusterIssuer
or Issuer resource (for example, by creating a 'letsencrypt-staging' issuer).
More information on the different types of issuers and how to configure them
can be found in our documentation:
https://cert-manager.readthedocs.io/en/latest/reference/issuers.html
For information on how to configure cert-manager to automatically provision
Certificates for Ingress resources, take a look at the `ingress-shim`
documentation:
https://cert-manager.readthedocs.io/en/latest/reference/ingress-shim.html
So yes, it seems to be related to #188
after removing these two entities (by removing the finalizers), backup again, recover again, everything finished now!
Thanks, I will merge this into https://github.com/rancher/backup-restore-operator/issues/188 and see if I can work on it this week.
Rancher 2.6.3 recovered to S3 - the recovery tar.gz is 7MB in size.
When trying to recover rancher on a new and empty K3s cluster, the recovery is running for ages .. when looking at the backup-restore operator logs, I can see logs like this running:
feels like it is trying to recover the whole bitnami helm registry..
I already disabled legacy features in the original rancher, but the catalogtemplateversions are still in the backup.
It's hard, doing a "disaster recovery" if it takes a week for a very small rancher system :D