rancher / backup-restore-operator

Apache License 2.0
100 stars 69 forks source link

Rancher recovery is running since 4 days already #242

Closed JannikZed closed 2 years ago

JannikZed commented 2 years ago

Rancher 2.6.3 recovered to S3 - the recovery tar.gz is 7MB in size.

When trying to recover rancher on a new and empty K3s cluster, the recovery is running for ages .. when looking at the backup-restore operator logs, I can see logs like this running:

INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-v2-concourse-0.1.16 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-v2-concourse-0.1.16
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-wavefront-hpa-adapter-1.0.1 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-wavefront-hpa-adapter-1.0.1 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-wavefront-hpa-adapter-1.0.1
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-consul-9.3.8 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-consul-9.3.8 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-consul-9.3.8
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-v2-fluentd-5.0.10 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-v2-fluentd-5.0.10 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-v2-fluentd-5.0.10
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-kube-prometheus-6.10.3 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-kube-prometheus-6.10.3 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-kube-prometheus-6.10.3
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-redis-cluster-4.2.1 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-redis-cluster-4.2.1 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-redis-cluster-4.2.1
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-v2-apache-9.0.10 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-v2-apache-9.0.10 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-v2-apache-9.0.10
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-elasticsearch-17.9.4 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-elasticsearch-17.9.4 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-elasticsearch-17.9.4
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-v2-ghost-16.2.5 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-v2-ghost-16.2.5 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-v2-ghost-16.2.5
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-kibana-8.1.0 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-kibana-8.1.0 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-kibana-8.1.0
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-magento-19.2.0 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-magento-19.2.0 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-magento-19.2.0
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-odoo-19.0.4 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-odoo-19.0.4 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-odoo-19.0.4
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-v2-nats-6.2.5 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-v2-nats-6.2.5 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] Successfully restored bitnami-v2-nats-6.2.5
INFO[2022/06/06 09:40:05] restoreResource: Restoring bitnami-external-dns-6.1.7 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:05] restoreResource: Namespace p-vjqd8 for name bitnami-external-dns-6.1.7 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:06] Successfully restored bitnami-external-dns-6.1.7
INFO[2022/06/06 09:40:06] restoreResource: Restoring bitnami-v2-memcached-5.11.0 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:06] restoreResource: Namespace p-vjqd8 for name bitnami-v2-memcached-5.11.0 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:06] Successfully restored bitnami-v2-memcached-5.11.0
INFO[2022/06/06 09:40:06] restoreResource: Restoring bitnami-v2-mongodb-sharded-3.2.7 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:06] restoreResource: Namespace p-vjqd8 for name bitnami-v2-mongodb-sharded-3.2.7 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:06] Successfully restored bitnami-v2-mongodb-sharded-3.2.7
INFO[2022/06/06 09:40:06] restoreResource: Restoring jhub-binderhub-0.2.0-n554.h4b41dc2 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:06] restoreResource: Namespace c-pbmqs for name jhub-binderhub-0.2.0-n554.h4b41dc2 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/06 09:40:06] Successfully restored jhub-binderhub-0.2.0-n554.h4b41dc2

feels like it is trying to recover the whole bitnami helm registry..

I already disabled legacy features in the original rancher, but the catalogtemplateversions are still in the backup.

It's hard, doing a "disaster recovery" if it takes a week for a very small rancher system :D

superseb commented 2 years ago

Is it actually running on a single restore attempt for so long? Its a controller that retries on failure, so if it fails half way, it will just restart after back-off. We'll need the failure or else the full log to diagnose what is going on. Please also share the installed chart version (if possible the version that was used to create the backup and the version that is used to restore the backup).

JannikZed commented 2 years ago

I'm trying to debug it further. As both clusters are up and running, I can test it easily. Rancher backup chart version: 2.1.2 Rancher Recover chart version: 2.1.2

Some more details on the restore:

kubectl get restore restore-migration -o yaml

apiVersion: resources.cattle.io/v1
kind: Restore
metadata:
  annotations:
  creationTimestamp: "2022-06-03T09:05:01Z"
  generation: 1
  name: restore-migration
  resourceVersion: "228628"
  uid: 2f5b0a52-4d8b-452d-96c5-3c9c9104484a
spec:
  backupFilename: nightly-c8797ff5-6394-11e8-9072-0242ac110002-2022-06-03T00-00-00Z.tar.gz
  prune: false
  storageLocation:
    s3:
      bucketName: XXXX
      credentialSecretName: s3-creds
      credentialSecretNamespace: default
      endpoint: storage.googleapis.com
      region: null
status:
  backupSource: ""
  conditions:
  - lastUpdateTime: "2022-06-03T10:12:46Z"
    message: error restoring namespaced resources, check logs for exact error
    reason: Error
    status: "False"
    type: Reconciling
  - lastUpdateTime: "2022-06-03T10:12:46Z"
    message: Retrying
    status: Unknown
    type: Ready
  observedGeneration: 0
  restoreCompletionTs: ""
  summary: ""

So there seems to be an issue, but with that large amount of logs, it is hard to find the root-cause. Is there a way, to get the logs related to this error? With this little input, I can't find out, what namespace resources failed exactly.

superseb commented 2 years ago

Yes, we have reduced logs in the upcoming version because of this reason. For now you can grep for error or fail;

kubectl  -n cattle-resources-system logs -l app.kubernetes.io/name=rancher-backup --tail=-1 | egrep -i "error|fail"
JannikZed commented 2 years ago

I can find these two errors:

kubectl  -n cattle-resources-system logs -l app.kubernetes.io/name=rancher-backup --tail=-1 | egrep -i "error|fail"
INFO[2022/06/06 13:11:38] restoreResource: Restoring cluster-scan-manual-failure-only of type management.cattle.io/v3, Resource=clusteralertrules
INFO[2022/06/06 13:11:38] restoreResource: Namespace c-pbmqs for name cluster-scan-manual-failure-only of type management.cattle.io/v3, Resource=clusteralertrules
INFO[2022/06/06 13:11:38] Successfully restored cluster-scan-manual-failure-only
ERRO[2022/06/06 13:12:21] Error restoring resource fleet-agent-hetzner-eu-nbg of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err creating resource bundledeployments.fleet.cattle.io "fleet-agent-hetzner-eu-nbg" is forbidden: unable to create new content in namespace cluster-fleet-default-hetzner-eu-nbg-30430c2cbc38 because it is being terminated
INFO[2022/06/06 13:12:27] restoreResource: Restoring cluster-scan-scheduled--failure-only of type management.cattle.io/v3, Resource=clusteralertrules
INFO[2022/06/06 13:12:27] restoreResource: Namespace local for name cluster-scan-scheduled--failure-only of type management.cattle.io/v3, Resource=clusteralertrules
INFO[2022/06/06 13:12:27] Successfully restored cluster-scan-scheduled--failure-only
ERRO[2022/06/06 13:13:24] Error restoring resource request-vtcpq-66e4ec45-556d-4352-a61f-4261f6a52af2 of type /v1, Resource=serviceaccounts: restoreResource: err creating resource serviceaccounts "request-vtcpq-66e4ec45-556d-4352-a61f-4261f6a52af2" is forbidden: unable to create new content in namespace cluster-fleet-default-hetzner-eu-nbg-30430c2cbc38 because it is being terminated

I don't know, what this namespace is being used for and I could delete it in the origin cluster, backup again and check, if it is running then.

JannikZed commented 2 years ago

After cleaning the two problematic namespaces, I'm facing now these errors:

WARN[2022/06/06 20:42:59] Error getting object for controllerRef rancher, skipping it
ERRO[2022/06/06 20:42:59] Error restoring namespaced resources [error restoring cert-manager of type project.cattle.io/v3, Resource=apps: restoreResource: err updating resource App.project.cattle.io "cert-manager" is invalid: metadata.deletionGracePeriodSeconds: Invalid value: 0: field is immutable error restoring fleet-agent-local of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err updating status resource Operation cannot be fulfilled on bundledeployments.fleet.cattle.io "fleet-agent-local": the object has been modified; please apply your changes to the latest version and try again error restoring rocketchat of type project.cattle.io/v3, Resource=apps: restoreResource: err updating resource App.project.cattle.io "rocketchat" is invalid: metadata.deletionGracePeriodSeconds: Invalid value: 0: field is immutable]
ERRO[2022/06/06 20:42:59] error syncing 'restore-migration': handler restore: error restoring namespaced resources, check logs for exact error, requeuing
WARN[2022/06/06 20:43:01] Error getting object for controllerRef rancher, skipping it

Especially "rocketchat" is interesting - never used that before ..

superseb commented 2 years ago

That seems like an error similar to the one in https://github.com/rancher/backup-restore-operator/issues/188, can you share the resource json from the problematic namespaces here so I can check what the problem is on those?

JannikZed commented 2 years ago

I can see, that they are in the upstream server in status "removing" .. Screenshot 2022-06-07 at 09 20 38

Here is the corresponding yaml definition of cert-manager:

apiVersion: project.cattle.io/v3
kind: App
metadata:
  annotations:
    field.cattle.io/creatorId: user-4qxbk
    lifecycle.cattle.io/create.helm-controller_c-t6lrq: "true"
  creationTimestamp: "2018-11-22T11:39:43Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2019-08-25T18:51:22Z"
  finalizers:
  - clusterscoped.controller.cattle.io/helm-controller_c-t6lrq
  generation: 2
  labels:
    cattle.io/creator: norman
  name: cert-manager
  namespace: p-9hlht
  resourceVersion: "55421992"
  uid: 4e1e5699-ee4b-11e8-86b4-0242ac110002
spec:
  appRevisionName: apprevision-x5sxt
  externalId: catalog://?catalog=helm&template=cert-manager&version=v0.5.2
  projectName: c-t6lrq:p-9hlht
  targetNamespace: cert-manager
status:
  conditions:
  - lastUpdateTime: "2018-11-22T11:39:43Z"
    status: "True"
    type: Migrated
  - lastUpdateTime: "2018-11-22T11:39:45Z"
    status: "True"
    type: Installed
  notes: |+
    ---
    # Source: cert-manager/templates/NOTES.txt
    cert-manager has been deployed successfully!

    In order to begin issuing certificates, you will need to set up a ClusterIssuer
    or Issuer resource (for example, by creating a 'letsencrypt-staging' issuer).

    More information on the different types of issuers and how to configure them
    can be found in our documentation:

    https://cert-manager.readthedocs.io/en/latest/reference/issuers.html

    For information on how to configure cert-manager to automatically provision
    Certificates for Ingress resources, take a look at the `ingress-shim`
    documentation:

    https://cert-manager.readthedocs.io/en/latest/reference/ingress-shim.html

So yes, it seems to be related to #188

JannikZed commented 2 years ago

after removing these two entities (by removing the finalizers), backup again, recover again, everything finished now!

superseb commented 2 years ago

Thanks, I will merge this into https://github.com/rancher/backup-restore-operator/issues/188 and see if I can work on it this week.