rancher / backup-restore-operator

Apache License 2.0
98 stars 67 forks source link

Restoring a backup from local file #237

Closed rpelissi closed 2 years ago

rpelissi commented 2 years ago

Hi, So I am trying to understand how to do a restore in Rancher and tried to follow the documentation found here: https://rancher.com/docs/rancher/v2.6/en/backups/migrating-rancher/ The problem is that the documentation gives examples with s3/minio but not with local path so I am a bit lost how what do to and I have to admit that I just learning... So, I have my backup file on one of the node, I am trying to create a file:

[root@node-1 ~]# cat create-deflocation-restore.yaml
apiVersion: resources.cattle.io/v1
kind: Restore
metadata:
  name: restore-pvc-demo
spec:
  backupFilename: daily-4a197c6b-2cff-4dae-bc12-c75a4c72c5f1-2022-05-22T00-00-00Z.tar.gz

So the daily-4a197c6b-2cff-4dae-bc12-c75a4c72c5f1-2022-05-22T00-00-00Z.tar.gz is my backup file. Of course it is not working:

[root@node-1 ~]# kubectl describe restore restore-pvc-demo
Name:         restore-pvc-demo
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  resources.cattle.io/v1
Kind:         Restore
Metadata:
  Creation Timestamp:  2022-05-29T00:46:29Z
  Generation:          1
  Managed Fields:
    API Version:  resources.cattle.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:prune:
        f:storageLocation:
      f:status:
        .:
        f:backupSource:
        f:conditions:
        f:observedGeneration:
        f:restoreCompletionTs:
        f:summary:
    Manager:      backup-restore-operator
    Operation:    Update
    Time:         2022-05-29T00:46:29Z
    API Version:  resources.cattle.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:backupFilename:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2022-05-29T00:46:29Z
  Resource Version:  277308
  UID:               14a841f1-0c68-40eb-82f9-28db9fe478a0
Spec:
  Backup Filename:  daily-4a197c6b-2cff-4dae-bc12-c75a4c72c5f1-2022-05-22T00-00-00Z.tar.gz
Status:
  Backup Source:
  Conditions:
    Last Update Time:     2022-05-29T00:46:29Z
    Message:              Backup location not specified on the restore CR, and not configured at the operator level
    Reason:               Error
    Status:               False
    Type:                 Reconciling
    Last Update Time:     2022-05-29T00:46:29Z
    Message:              Retrying
    Status:               Unknown
    Type:                 Ready
  Observed Generation:    0
  Restore Completion Ts:
  Summary:
Events:                   <none>

So I think I need to configure the volume/claim on the operator level (not sure) so the restore job know how to connect to it and get the file. But I have not real clue on how to do this... I guess the steps are:

Can you assist my with those steps please? I am pretty sure this will help other users and also make the documentation even more usefull.

Thanks!

rpelissi commented 2 years ago

So here I am from my investigations. We have those storageclass avaiable:

[root@node-1 ~]# kubectl get storageclasses.storage.k8s.io
NAME                   PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)   rancher.io/local-path          Delete          WaitForFirstConsumer   false                  24h
longhorn (default)     driver.longhorn.io             Delete          Immediate              true                   20h

I ma using longhorn and local-path could be good but I was not sure how to use this, so I created a local one like this:

[root@node-1 ~]# cat local-storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: Immediate

So

[root@node-1 ~]# kubectl get storageclasses.storage.k8s.io
NAME                   PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)   rancher.io/local-path          Delete          WaitForFirstConsumer   false                  24h
local-storage          kubernetes.io/no-provisioner   Delete          Immediate              false                  63m
longhorn (default)     driver.longhorn.io             Delete          Immediate              true                   20h

Now let's create the volume and the volume claim

[root@node-1 ~]# cat pv-volume.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: task-pv-volume
  labels:
    type: local
spec:
  storageClassName: local-storage
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/backup"
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - node1

I have set a node affinity so I can set the mount where I want (seems logic to me but could be totally stupid and in fact lead me to some issues after)

[root@node-1 ~]# cat pv-claim.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: task-pv-claim
spec:
  storageClassName: local-storage
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi

So:

[root@node-1 ~]# kubectl get persistentvolumes
NAME             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                      STORAGECLASS    REASON   AGE
backup           10Gi       RWO            Retain           Released   cattle-resources-system/rancher-backup-1                            18h
task-pv-volume   10Gi       RWO            Retain           Bound      default/task-pv-claim                      local-storage            62m
[root@node-1 ~]# kubectl get persistentvolumeclaims
NAME            STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS    AGE
task-pv-claim   Bound    task-pv-volume   10Gi       RWO            local-storage   62m

After reading a little more on the backup restore operator, it seems that I have to defined when I install the operator the volume location (again I could be wrong...) I want to use this option (which is my case because my backup files are local) So I read tis: https://rancher.com/docs/rancher/v2.6/en/backups/configuration/storage-config/#existing-persistent-volume

Seems that I have to deplou the bckup restore operator with custom values. Fine. I took the template found on the web page above and set volumeName to task-pv-claim.

then execute: [root@node-1 ~]# helm install rancher-backup rancher-charts/rancher-backup -n cattle-resources-system --version 2.1.2 -f values.yaml

but the pod fail with:

  Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  17s               default-scheduler  Successfully assigned cattle-resources-system/rancher-backup-cb4f7564d-w4rw7 to worker-3
  Warning  Failed     16s               kubelet            Failed to pull image "rancher/rancher-backup:v2.1.2": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/rancher/rancher-backup:v2.1.2": failed to resolve reference "docker.io/rancher/rancher-backup:v2.1.2": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
  Warning  Failed     16s               kubelet            Error: ErrImagePull
  Normal   BackOff    15s               kubelet            Back-off pulling image "rancher/rancher-backup:v2.1.2"
  Warning  Failed     15s               kubelet            Error: ImagePullBackOff
  Normal   Pulling    0s (x2 over 16s)  kubelet            Pulling image "rancher/rancher-backup:v2.1.2"

Found that the template on the web page could be wrong: Instead of:

image:
  repository: rancher/rancher-backup
  tag: v0.0.1-rc10

I use:

image:
  repository: rancher/backup-restore-operator

and now the pod is working correctly.

So, I recreate the restore job but still got issue:

[root@node-1 ~]# kubectl get Restore
NAME               BACKUP-SOURCE   BACKUP-FILE                                                              AGE   STATUS
restore-pvc-demo                   daily-4a197c6b-2cff-4dae-bc12-c75a4c72c5f1-2022-05-22T00-00-00Z.tar.gz   49m   Retrying
[root@node-1 ~]# kubectl describe Restore
..
Spec:
  Backup Filename:  daily-4a197c6b-2cff-4dae-bc12-c75a4c72c5f1-2022-05-22T00-00-00Z.tar.gz
Status:
  Backup Source:
  Conditions:
    Last Update Time:     2022-05-29T15:05:25Z
    Message:              Backup location not specified on the restore CR, and not configured at the operator level

So at that point I am not sure that I have done wrong.. Maybe because the backup restore pod is running on another worker node...

[root@node-1 ~]# kubectl describe pod/rancher-backup-74779d9dfd-fdndh -n cattle-resources-system
Name:         rancher-backup-74779d9dfd-fdndh
Namespace:    cattle-resources-system
Priority:     0
Node:         worker-3/192.168.2.105
Start Time:   Sun, 29 May 2022 11:13:05 -0400
Labels:       app.kubernetes.io/instance=rancher-backup
              app.kubernetes.io/name=rancher-backup
              pod-template-hash=74779d9dfd
              resources.cattle.io/operator=backup-restore
Annotations:  checksum/pvc: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
              checksum/s3: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
Status:       Running
IP:           10.42.5.53
IPs:
  IP:           10.42.5.53
Controlled By:  ReplicaSet/rancher-backup-74779d9dfd
Containers:
  rancher-backup:
    Container ID:   containerd://4404353cda78995dcb2aeef1c8b75d623cd5a99c136db07659edb1242b70a4fe
    Image:          rancher/backup-restore-operator:v2.1.2
    Image ID:       docker.io/rancher/backup-restore-operator@sha256:acbb9ae36580b53ec87a953a18a98e0b0bc0bcefe2100850dee7c66f8a978169
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Sun, 29 May 2022 11:13:06 -0400
    Ready:          True
    Restart Count:  0
    Environment:
      CHART_NAMESPACE:  cattle-resources-system
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from rancher-backup-token-8f7p9 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  rancher-backup-token-8f7p9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rancher-backup-token-8f7p9
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     cattle.io/os=linux:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  34m   default-scheduler  Successfully assigned cattle-resources-system/rancher-backup-74779d9dfd-fdndh to worker-3
  Normal  Pulling    34m   kubelet            Pulling image "rancher/backup-restore-operator:v2.1.2"
  Normal  Pulled     34m   kubelet            Successfully pulled image "rancher/backup-restore-operator:v2.1.2" in 326.073136ms
  Normal  Created    34m   kubelet            Created container rancher-backup
  Normal  Started    34m   kubelet            Started container rancher-backup

Not sure.. I will try to continur to investigate but any help will be more than welcome.

rpelissi commented 2 years ago

Ok I may have made a mistake for the custom values for my volume I now use:

image:                                                                                                                                             
  repository: rancher/rancher-backup                                                                                                               
  #tag: v0.0.1-rc10                                                                                                                                
  #tag: latest                                                                                                                                     
  #tag: v2.1.2                                                                                                                                     

## Default s3 bucket for storing all backup files created by the rancher-backup operator                                                           
s3:                                                                                                                                                
  enabled: false                                                                                                                                   
  ## credentialSecretName if set, should be the name of the Secret containing AWS credentials.                                                     
  ## To use IAM Role, don't set this field                                                                                                         
  credentialSecretName: creds                                                                                                                      
  credentialSecretNamespace: ""                                                                                                                    
  region: us-west-2                                                                                                                                
  bucketName: rancherbackups                                                                                                                       
  folder: base folder                                                                                                                              
  endpoint: s3.us-west-2.amazonaws.com                                                                                                             
  endpointCA: base64 encoded CA cert                                                                                                               
  # insecureTLSSkipVerify: optional                                                                                                                

## ref: http://kubernetes.io/docs/user-guide/persistent-volumes/                                                                                   
## If persistence is enabled, operator will create a PVC with mountPath /var/lib/backups                                                           
persistence:                                                                                                                                       
  enabled: false                                                                                                                                   

  ## If defined, storageClassName: <storageClass>                                                                                                  
  ## If set to "-", storageClassName: "", which disables dynamic provisioning                                                                      
  ## If undefined (the default) or set to null, no storageClassName spec is                                                                        
  ##   set, choosing the default provisioner.  (gp2 on AWS, standard on                                                                            
  ##   GKE, AWS & OpenStack).                                                                                                                      
  ## Refer to https://kubernetes.io/orage/persistentdocs/concepts/st-volumes/#class-1                                                              
  ##                                                                                                                                               
  storageClass: "-"                                                                                                                                

  ## If you want to disable dynamic provisioning by setting storageClass to "-" above,                                                             
  ## and want to target a particular PV, provide name of the target volume                                                                         
  volumeName: "task-pv-claim"                                                                                                                      

  ## Only certain StorageClasses allow resizing PVs; Refer to https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/  
  size: 2Gi                                                                                                                                        

global:                                                                                                                                            
  cattle:                                                                                                                                          
    systemDefaultRegistry: ""                                                                                                                      

nodeSelector: {}                                                                                                                                   

tolerations: []                                                                                                                                    

affinity: {}                                                                                                                                       

and then

helm install rancher-backup-crd rancher-charts/rancher-backup-crd -n cattle-resources-system --create-namespace --version 2.1.2 -f values.yaml
helm install rancher-backup rancher-charts/rancher-backup -n cattle-resources-system --version 2.1.2

Still got his:

[root@node-1 ~]# kubectl describe Restore
Name:         restore-pvc-demo
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  resources.cattle.io/v1
Kind:         Restore
Metadata:
  Creation Timestamp:  2022-05-29T16:20:46Z
  Generation:          1
  Managed Fields:
    API Version:  resources.cattle.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:prune:
        f:storageLocation:
      f:status:
        .:
        f:backupSource:
        f:conditions:
        f:observedGeneration:
        f:restoreCompletionTs:
        f:summary:
    Manager:      backup-restore-operator
    Operation:    Update
    Time:         2022-05-29T16:20:46Z
    API Version:  resources.cattle.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:backupFilename:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2022-05-29T16:20:46Z
  Resource Version:  654441
  UID:               4095284c-38e7-45b6-b7f0-c18d0e6bcf44
Spec:
  Backup Filename:  daily-4a197c6b-2cff-4dae-bc12-c75a4c72c5f1-2022-05-22T00-00-00Z.tar.gz
Status:
  Backup Source:
  Conditions:
    Last Update Time:     2022-05-29T16:20:46Z
    Message:              Backup location not specified on the restore CR, and not configured at the operator level
    Reason:               Error
    Status:               False
    Type:                 Reconciling
    Last Update Time:     2022-05-29T16:20:46Z
    Message:              Retrying
    Status:               Unknown
    Type:                 Ready
  Observed Generation:    0
  Restore Completion Ts:
  Summary:
Events:                   <none>
rpelissi commented 2 years ago

So maybe I am wrong.. maybe a custom section in the https://rancher.com/docs/rancher/v2.6/en/backups/configuration/storage-config/#example-values-yaml-for-the-rancher-backup-helm-chart is needed.. Like in this example, the storage is s3, but i need a local storage instead but I have no ideas on how to define this in this yaml file...

superseb commented 2 years ago

This definitely needs some clarification as everything is mostly focused on S3. Here are some quick steps that I used while I'm working on improving this:

I tested this on k3s + local-path storageclass.

rpelissi commented 2 years ago

Hi, Thanks for those useful infos, very appreciated :) I have tried to work on it yesterday but not yet been able to make it work. I will do another attempt today for sure.

rpelissi commented 2 years ago

Hello! I'm back. So this the current status. I was able to create a volume and have the rancher backup operator see the backup file:

[root@node-1 ~]# cat pv-volume-rancher.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: task-pv-volume
  labels:
    type: local
spec:
  storageClassName: local-path
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/backup"
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - worker-1
[root@node-1 ~]# kubectl get pv
NAME             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                      STORAGECLASS   REASON   AGE
backup           10Gi       RWO            Retain           Released   cattle-resources-system/rancher-backup-1                           3d21h
task-pv-volume   10Gi       RWO            Retain           Bound      cattle-resources-system/rancher-backup-1   local-path              5h36m

I see the rancher backup pod running and the claim is done

[root@node-1 ~]# kubectl get pvc -A
NAMESPACE                 NAME               STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   AGE
cattle-resources-system   rancher-backup-1   Bound    task-pv-volume   10Gi       RWO            local-path     5h36m

I can see the file at the location/worker node I have selected

[root@node-1 ~]# kubectl -n cattle-resources-system exec deploy/rancher-backup -- ls /var/lib/backups
daily-4a197c6b-2cff-4dae-bc12-c75a4c72c5f1-2022-05-22T00-00-00Z.tar.gz

Now, the next step I think is to create the restore custom resource but.... in the example given here: https://rancher.com/docs/rancher/v2.6/en/backups/migrating-rancher/ the storageLocation is set to s3. In case I have mounted on worker-1:/backup what I am supposed to set in this file?

Thanks again for your patience and help.

rpelissi commented 2 years ago

Ok found it!

[root@node-1 ~]# cat migrationResource.yaml
# migrationResource.yaml
apiVersion: resources.cattle.io/v1
kind: Restore
metadata:
  name: restore-migration
spec:
  backupFilename: daily-4a197c6b-2cff-4dae-bc12-c75a4c72c5f1-2022-05-22T00-00-00Z.tar.gz
  prune: false

And then [root@node-1 ~]# kubectl apply -f migrationResource.yaml

After checking the logs using: kubectl logs -n cattle-resources-system --tail 100 -f rancher-backup-xxxxxx

I see this:

INFO[2022/06/01 18:53:56] restoreResource: Restoring library-nfs-provisioner-0.1.2 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/01 18:53:56] restoreResource: Namespace cattle-global-data for name library-nfs-provisioner-0.1.2 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/01 18:53:56] Getting new UID for library-nfs-provisioner
INFO[2022/06/01 18:53:56] restoreResource: Restoring library-nfs-provisioner-0.2.2 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/01 18:53:56] restoreResource: Namespace cattle-global-data for name library-nfs-provisioner-0.2.2 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/01 18:53:56] Getting new UID for library-nfs-provisioner
INFO[2022/06/01 18:53:56] restoreResource: Restoring library-prometheus-9.1.0 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/01 18:53:56] restoreResource: Namespace cattle-global-data for name library-prometheus-9.1.0 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/01 18:53:56] Getting new UID for library-prometheus
INFO[2022/06/01 18:53:56] restoreResource: Restoring library-prometheus-6.2.1 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/01 18:53:56] restoreResource: Namespace cattle-global-data for name library-prometheus-6.2.1 of type management.cattle.io/v3, Resource=catalogtemplateversions
INFO[2022/06/01 18:53:56] Getting new UID for library-prometheus
INFO[2022/06/01 18:53:56] Processing controllerRef apps/v1/deployments/rancher
WARN[2022/06/01 18:53:56] Error getting object for controllerRef rancher, skipping it
INFO[2022/06/01 18:53:57] Done restoring

So far so good then, time to go with the next steps

rpelissi commented 2 years ago

So, I have followed the steps in https://rancher.com/docs/rancher/v2.6/en/backups/migrating-rancher/

But.. either I am doing something wrong of either I have created my backup not correctly the first time but... even if I can access rancher now, all my deployments are gone... That's pretty weird, I will try to dig in my old backups and see if it is the same situation after a restore... or maybe I have done something wrong?

superseb commented 2 years ago

I guess the expectation might be wrong here, by default, it backs up and restores Rancher, not everything. The default set of resources that is being backed up can be found here: https://github.com/rancher/backup-restore-operator/tree/master/charts/rancher-backup/files/default-resourceset-contents

If there are resources that match this selection and are not backed up, please share what exactly you are missing.

rpelissi commented 2 years ago

Oh. So the backup apps in rancher does not backup the workloads definitions, and custom storage definition, kind of thing then? That's make more sense now even if I am a little surprised to be honest :)

So I guess the ticket can be closed since I have the current process now to restore rancher from the backup file. I have 2 concerns/comments:

superseb commented 2 years ago

Can you share where you would like to have more information added? On https://rancher.com/docs/rancher/v2.6/en/backups/, it says:

The rancher-backup operator is used to backup and restore Rancher on any Kubernetes cluster. This application is a Helm chart, and it can be deployed through the Rancher Apps & Marketplace page, or by using the Helm CLI. The rancher-backup Helm chart is [here.](https://github.com/rancher/charts/tree/release-v2.6/charts/rancher-backup)

The backup-restore operator needs to be installed in the local cluster, and only backs up the Rancher app. The backup and restore operations are performed only in the local Kubernetes cluster.

Regarding backing up other resources, there are quite a few ways to do this, all with different approaches and strategies. The operator was created to backup/restore Rancher + migrating Rancher to a different set of nodes. One approach would be that all the other resources that you want to deploy, are in any automation of your choice and you deploy them after the restore of Rancher has been finished (ansible/terraform etc). This would be the recommended path.

If you really want to backup other resources, you could add your own resourcesets and specify what needs to be included in the backup (currently this is scoped to Rancher only) and that's why you don't see any non Rancher resources. I'd have to check to see how this done through the current Helm chart.

superseb commented 2 years ago

@rpelissi Let me know if you need anything else on this.

rpelissi commented 2 years ago

Hi! Sorry sorry I have been busy with other stuff :) So In fact I am disapointed not because of the tool but because I have not read correctly the documentation that in fact mention that the resources taken in the backup does not contains workload definitions for ex :) So it's my entire fault! Now, I still not sure about a DRP plan for that, I mean we have this:

Let say that we lost all our ranchers infra but we still got the backup files for the 2 components listed above, can I:

That's my question. Also, if we can have an example on how we can add custom resource to the rancher backup, that could be cool too! :)

Thanks!

github-actions[bot] commented 2 years ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.