costrouc commented 3 years ago

Summary

QHub is currently lacking a backup and restore solution. Initially this issue was not sufficiently complex since all state was stored on a single nfs filestore. We talked about having a kubernetes cron job to run daily restic to update the filesystem to a single s3 bucket. However now there are starting to be databases and state stored in several other pvcs within QHub. We expect this to grow so we need a generic solutions that allows us to backup/restore all storage within a cluster. We are proposing kubernetes backups using velero which looks to be a well adopted open source solution for backup and restore.

Proposed implementation

We realize this is a large issue and it will be most likely easiest to approach this problem in steps.

The first step would be to deploy the velero helm chart within QHub. There are other examples of [deploying a helm chart within QHub in PRs. This being the most similar one https://github.com/Quansight/qhub/pull/733. This will only deploy the velero agent on the kubernetes cluster. This should be configured via a qhub-config.yaml configuration setting. The PR above gives an example of adding this setting. There will additionally be a key credentials that takes an arbitrary dict of credentials to pass on to the helm chart. See https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/README.md#option-1-cli-commands. These credentials will be used to setup file backups and block storage backups. schedule will control the frequency of regular backups. Backups should be an optional feature that is disabled by default.

velero:
  enabled: true/false
  schedule: "0 0 * * *"
  credentials:
     ...

Next once velero is deployed on the cluster there should be the ability to trigger a backup manually. Similar to how we handle terraform https://github.com/Quansight/qhub/blob/main/qhub/provider/terraform.py#L23. Since velero is a go binary it should be possible to transparently download the velero binary https://github.com/vmware-tanzu/velero/releases/tag/v1.6.1 and expose it in the cli behind a qhub backup and qhub restore command. For now we would like to create a velero provider in https://github.com/Quansight/qhub/tree/main/qhub/provider that can trigger a backup and restore of the qhub storage.

Initially we would like a simple qhub deploy and qhub restore command. Eventually we could imagine this command growing into more complicated backups but we realize this problem is complicated enough as it is scoped.

Additionally there should be documentation added for the admin and dev guide.

Acceptance Criteria

[ ] upon initial deployment of QHub cluster and configuration setting backups enabled the cluster should be backup every 24h to an s3 bucket
[ ] qhub backup should trigger a manual backup of the cluster with files being backed up to s3 bucket
[ ] qhub restore should trigger a restore action that will refresh the contents of pvcs within cluster (this is less well understood at the moment and may not be possible).
- [ ] Velero is installed via a helm chart instead of the velero binary

Tasks to complete

https://github.com/Quansight/qhub/issues/744 work with @tarundmsharma to complete deployment of helm chart using terraform
[x] #745
[x] #746

Related to

For history, see https://github.com/Quansight/qhub/issues/99

costrouc commented 3 years ago

Some good point raised during our meeting. The velero backup will only apply to block storage pv/pvcs deployed within kubernetes on the specific cloud providers. We do use efs from aws and this would be out of scope.

Additionally I forsee the conversation around storing credentials within the qhub-config.yaml for s3 bucket access. For now assume that storing credentials within the bucket is okay. This is because other future PRs will solve storing secrets in the configuraition.

costrouc commented 3 years ago

Assigning:

@toonarmycaptain
@cleonard

toonarmycaptain commented 3 years ago

Some issues raised:

Whether the VMWare/Tanzu Helm chart will fit our needs, and how much modification it will need.
Dependency on VMWare to maintain (and keep open source) Velero and the Helm chart
DigitalOcean is not supported by Velero, and while there is a community supported plugin (in DO's github) it does not appear to be under regular development/maintenance.
Amount of work/maintenance necessary to make and keep a Velero solution in qhub cloud agnostic

toonarmycaptain commented 3 years ago

Looking at current alternatives to Valero:

	Cost eg Paid/limited free/free	Source eg OSS/source available/closed source	Features eg Full backup/etcd only	Tool/platform	Presently Maintained
Portworx PX-Backup	Limited 5TB/5 nodes/30 vol	No, relies on OSS libs	Full	Tool	Yes
Kasten	Limited to 10 nodes	No	Full	Tool	Yes
Kubedr ->CloudCasa	Free	OSS - Apache	etcd only	Tool	Alpha/unmaintained - 03/2020
Rancher/Longhorn	Free	OSS	Full	Platform/storage tool	Yes
Stash by AppsCode	Free Community Edition	OSS - must apply for 1 yr free license	Limited - no local/auto/batch backup in Free Edition	Tool	Yes

tl;dr the market is very limited vis á vis FOSS alternatives to Velero.

costrouc commented 3 years ago

Wanted to document a solution I got working on prem via minikube and via digital ocean. This seems to be cloud agnostic for backups which seems promising. In addition I didn't realize how complete the velero backups are. They include all of the resources as well and give strong controls on the backup.

minikube start --driver=docker --kubernetes-version=v1.21.3

To start the minikube cluster. Then we need to create the minio s3 backup. Sure we could use a cloud based backup.

apiVersion: v1
kind: Service
metadata:
  name: minio
spec:
  type: NodePort
  ports:
  - name: "9000"
    nodePort: 30900
    port: 9000
    targetPort: 9000
  - name: "9001"
    nodePort: 30901
    port: 9001
    targetPort: 9001
  selector:
    app: minio
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: minio-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: minio
  labels:
    app: minio
spec:
  replicas: 1
  selector:
    matchLabels:
      app: minio
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: minio
    spec:
      containers:
        - name: minio
          image: minio/minio:RELEASE.2021-08-25T00-41-18Z
          args:
            - "-c"
            - "mkdir -p /data/velero && /usr/bin/minio server /data --console-address 0.0.0.0:9001"
          command:
            - "sh"
          env:
            - name: MINIO_ACCESS_KEY
              value: admin
            - name: MINIO_SECRET_KEY
              value: password
          ports:
            - containerPort: 9000
            - containerPort: 9001
          volumeMounts:
            - mountPath: /data
              name: minio-claim
      restartPolicy: Always
      volumes:
        - name: minio-claim
          persistentVolumeClaim:
            claimName: minio-claim

and then an example application to test the backup with

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pod-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: hellopod
spec:
  containers:
    - name: hello
      image: busybox
      imagePullPolicy: IfNotPresent
      command:
      - /bin/sh
      - -c
      - "date >> /data/example.txt; sleep 100000"
      volumeMounts:
        - mountPath: /data
          name: pod-claim
  restartPolicy: OnFailure
  volumes:
    - name: pod-claim
      persistentVolumeClaim:
        claimName: pod-claim

Then kubectl apply both of these charts. Next we install velero and also install velero on the cluster. We need to create a file for the credentials for our S3 bucket and how to access it.

[default]
aws_access_key_id = admin
aws_secret_access_key = password

And then we download velero

wget https://github.com/vmware-tanzu/velero/releases/download/v1.6.3/velero-v1.6.3-linux-amd64.tar.gz
tar -xf *.tar.gz
cd velero-*

./velero install --provider=aws --plugins velero/velero-plugin-for-aws:v1.0.0 --use-restic --use-volume-snapshots=false --bucket=velero --secret-file /tmp/velero/credentials.txt --backup-location-config region=default,s3ForcePathStyle="true",s3Url=http://minio.default.svc:9000

Finally lets demonstrate a backup

./velero backup create anexample --default-volumes-to-restic=true

You can check that a backup was performed successfully by visiting the web ui for the minio. The minikube ip address is posible via minikube ip and the port is 30900 additionally you can also access the ui via port forwarding https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/. I also breifly tested deleting the pod resource and then restoring the volume. This seemed to work though I didn't test this as much. However, the backup is clearly happening on DO and minikube. On prem velero has issues with hostPaths as pvc volumes however outside of testing I would consider this a rare circumstance since for any true multinode kubernetes deploements hostPaths cannot work.

This also looks like it will be able to backup efs and cloud specific pvcs :smile:. So good news @brl0! Still very much POC but I believe this tool will work great for our use case and then some.

iameskild commented 2 years ago

At a high-level it appears that a Velero + Restic backup will mostly likely work for our purposes. I started my testing on a minikube cluster but kept running into errors (they might still be user-errors) so I decided to repurpose an existing AWS deployment I was using; I had much better success backing up and restoring on the AWS QHub cluster (steps outlined below). There are still a handful of things to test and consider:

Test with main
- ensure keycloak postgresql db is also properly restored
Test on other cloud providers
Explore using Helm chart to backup / restore
Explore how end-user would go about restoring system
- CI/CD workflow might be most convenient and aligns with the infrastructure as code paradym

Steps

qhub --version
0.3.13

# brew install velero 
velero version 
Client:
    Version: v1.7.0
    Git commit: -
Server:
    Version: v1.7.0

velero install
--provider=aws
--plugins=velero/velero-plugin-for-aws:v1.3.0
--use-restic
--default-volumes-to-restic=true
--bucket=$BUCKET
--secret-file ./credentials.txt
--backup-location-config region=$REGION,s3ForcePathStyle=true,s3Url=http://s3.$REGION.amazonaws.com
--wait
--snapshot-location-config region=$REGION

velero backup create test --include-namespaces=dev --wait

# Tear down QHub 
python -m qhub destroy -c qhub-config.yaml

# Redeploy with same config file
python -m qhub deploy -c qhub-config.yaml

# Prepare for NFS restore
# - delete nfs-mount-dev-share, conda-store-dev-share PVCs
# - delete jupyterhub-sftp Deployment

# Update PV reclaim-status from "Released" to "Available"
k patch pv nfs-mount-dev-share -p '{spec:{claimRef: null}}'
k patch pv conda-store-dev-share -p '{spec:{claimRef: null}}'

velero create restore test-restore --from-backup=test

Obervations

If the resource is already online and available, then the restore will log a warning, skip it and move on
Upon restore, three dask-schedulers with a handful of workers each, were also restored
- A possible way to "flush" the state of the cluster might involve Velero backup hooks

This restore completed with a Partil-Fail status due to it's inability to restore the jupyterhub-sftp volume home

The error message states that the error is related to there already being a shared folder (see below)

Although I haven't tested it yet, I suspect if we delete the shared folder (even if it's empty) prior to the restore, we can get the restic restore to complete successfully

Errors:
Velero:  pod volume restore failed: error restoring volume: error running restic restore, cmd=restic restore --repo=s3:http://s3.eu-west-2.amazonaws.com/eaeqhubbu/restic/dev --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic bd6d4b69 --target=., stdout=restoring <Snapshot bd6d4b69 of [/host_pods/f3eaeaa7-7755-4c99-b79c-4fbb5029ee55/volumes/kubernetes.io~nfs/nfs-mount-dev-share] at 2021-11-11 23:39:55.972809323 +0000 UTC by root@velero> to .
, stderr=ignoring error for /home/iameskild/shared: Symlink: symlink /home/shared /host_pods/003f3c3d-d922-4784-a26c-ea60dd639775/volumes/kubernetes.io~nfs/nfs-mount-dev-share/home/iameskild/shared: file exists
Fatal: There were 1 errors

tylerpotts commented 2 years ago

The is verified working on AWS and GCP:

Backup

In order to specify a volume for restic restoration, we need to annotate a pod with backup.velero.io/backup-volumes: <pods_name_for_persistentvolume>. I decided to do this by creating a pod specifically for this purpose. With the following saved as custom_pod.yaml I added it to the cluster with kubectl apply -f custom_pod.yaml

kind: Pod
apiVersion: v1
metadata:
  name: restic-placeholder
  namespace: dev
  annotations:
    backup.velero.io/backup-volumes: home
spec:
  volumes:
    - name: home
      persistentVolumeClaim:
        claimName: "nfs-mount-dev-share"
  containers:
    - name: placeholder
      image: ubuntu
      command: ["sleep", "36000000000000"]
      volumeMounts:
        - mountPath: "/data"
          name: home

To avoid errors on mounts that don't need to be backed up, set the following labels to exclude the persistentvolumeclaims like so:

kubectl label pvc conda-store-dev-share velero.io/exclude-from-backup=true -n dev
kubectl label pvc hub-db-dir velero.io/exclude-from-backup=true -n dev
kubectl label pvc qhub-conda-store-storage velero.io/exclude-from-backup=true -n dev

With this setup, velero can be installed with the default-volumes-to-restic=false:

velero install \
--provider=aws \
--plugins=velero/velero-plugin-for-aws:v1.3.0 \
--use-restic \
--default-volumes-to-restic=false \
--bucket=$BUCKET \
--secret-file ./credentials.txt \
--backup-location-config region=$REGION,s3ForcePathStyle=true,s3Url=http://s3.$REGION.amazonaws.com \
--wait \
--snapshot-location-config region=$REGION

The backup is created with:

velero backup create qhub-backup --include-namespaces=dev --wait

Restore

Note that all user notebook need to be shut down as well. Existing user sessions will maintain a connection to the persistent volume claim and prevent deletion. We delete the resources that are using the nfs-mount-dev-share with the commands below:

kubectl delete deployments qhub-jupyterhub-sftp -n dev
kubectl delete pod restic-placeholder -n dev
kubectl delete pvc nfs-mount-dev-share -n dev
kubectl patch pv nfs-mount-dev-share -p '{"spec":{"claimRef": null}}'

With these gone, the restore can be initiatied with:

velero restore create qhub-restore --from-backup qhub-backup

Note that the restore will say that it partially failed. This is because there is already a symlink for /home/shared. However, data in the user directories as well as the shared directories gets restored as expected.

tylerpotts commented 2 years ago

To copy the backed up data for home and shared to the current working directory, run the following command:

restic -r s3:s3.amazonaws.com/<backup_bucket>/restic/dev --verbose=2 restore latest --target .

This will prompt for a password, which will always be static-passw0rd

iameskild commented 1 year ago

Now that we have Argo-Workflows enabled, we can run backup and restore workflows much more easily; with a few small updates, we can schedule backups as cron-workflows.

This backup/restore solution also relies on restic to perform the actual backup and restore.

Requirements and implementation

Here is an example of how we might want to run backup and restores.

We will need:

an image that contains restic and the cloud-specific CLI (gcloud, awscli, etc.)
a cloud provider service account with the ability to read/write/create storage buckets/blobs
secrets for each of the following:
- RESTIC_REPOSITORY
- RESTIC_PASSWORD
- cloud specific credentials such as
  - GOOGLE_APPLICATION_CREDENTIALS - specific to the service account mentioned in 2.
  - GOOGLE_PROJECT_ID
one backup workflow and one restore workflow
- as mentioned above, it would make sense to have the backup workflow run on a schedule (i.e. everyday at midnight)

Resource details

I created this image so I could test this proposed solution (more on the results below). However in the long-term, we would likely require an image to be built and pushed to our open registries for each of the cloud providers that we support (AWS, Azure, DO, GCP).

Again, to test the feasibility of this solution, I created the following secrets:


apiVersion: v1
kind: Secret
metadata:
  name: google-application-credentials
  namespace: dev
type: Opaque
data:
  GOOGLE_APPLICATION_CREDENTIALS: ---

---

apiVersion: v1
kind: Secret
metadata:
  name: google-project-id
  namespace: dev
type: Opaque
data:
  GOOGLE_PROJECT_ID: ---

---

apiVersion: v1
kind: Secret
metadata:
  name: restic-repo
  namespace: dev
type: Opaque
data:
  RESTIC_REPOSITORY: ---

---

apiVersion: v1
kind: Secret
metadata:
  name: restic-password
  namespace: dev
type: Opaque
data:
  RESTIC_PASSWORD: ---

And then the actual workflows themselves.

Backup workflow

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: backup-workflow
  namespace: dev
spec:
  entrypoint: backup
  volumes:
  - name: google-application-credentials
    secret:
      secretName: google-application-credentials
  - name: nfs-volume
    persistentVolumeClaim:
      claimName: "jupyterhub-dev-share"  
  templates:
  - name: backup
    container:
      # image I created above
      image: ghcr.io/iameskild/restic:f4318fb95f1d63d414f4a3743d9488b4107d7367
      env:
      - name: GOOGLE_PROJECT_ID  
        valueFrom:
          secretKeyRef:
            name: google-project-id
            key: GOOGLE_PROJECT_ID
      - name: RESTIC_REPOSITORY
        valueFrom:
          secretKeyRef:
            name: restic-repo
            key: RESTIC_REPOSITORY
      - name: RESTIC_PASSWORD
        valueFrom:
          secretKeyRef:
            name: restic-password
            key: RESTIC_PASSWORD
      volumeMounts:
        - mountPath: "/var/secrets/google"
          name: google-application-credentials
       # mount the NFS drive
        - mountPath: "/exports"
          name: nfs-volume
     command: [sh, -c]
      args: ['
        gcloud auth activate-service-account --key-file=/var/secrets/google/GOOGLE_APPLICATION_CREDENTIALS;
        restic init;
        restic backup /exports 
      ']

Restore workflow

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: restore-workflow
  namespace: dev
spec:
  entrypoint: restore
  volumes:
  - name: google-application-credentials
    secret:
      secretName: google-application-credentials
  - name: nfs-volume
    persistentVolumeClaim:
      claimName: "jupyterhub-dev-share"  
  templates:
  - name: restore
    container:
     # again, same image I created above
      image: ghcr.io/iameskild/restic:f4318fb95f1d63d414f4a3743d9488b4107d7367
      env:
      - name: GOOGLE_PROJECT_ID  
        valueFrom:
          secretKeyRef:
            name: google-project-id
            key: GOOGLE_PROJECT_ID
      - name: RESTIC_REPOSITORY
        valueFrom:
          secretKeyRef:
            name: restic-repo
            key: RESTIC_REPOSITORY
      - name: RESTIC_PASSWORD
        valueFrom:
          secretKeyRef:
            name: restic-password
            key: RESTIC_PASSWORD
      volumeMounts:
        - mountPath: "/var/secrets/google"
          name: google-application-credentials
       # mount the NFS drive
        - mountPath: "/exports"
          name: nfs-volume
      command: [sh, -c]
      args: ['
        gcloud auth activate-service-account --key-file=/var/secrets/google/GOOGLE_APPLICATION_CREDENTIALS;
        restic restore latest --target=/
        ']

Results

I have successfully tested this solution - including both the backup and restore steps - on a live cluster running on GCP with a backup residing on GCS 🎉

Notes and open items

A few things to note, given the cloud-native nature of this solution, we will need a way to manage secrets (unless we want to create service-account credentials during the deployment). This means that we will likely need to consider one of the proposed solutions to secret management (see the SOPS RFD and Vault RFD).

Obviously this is just a POC and will need to converted to a Terraform script, that said, this solution looks very promising and I am curious what the rest of the team thinks.

dharhas commented 1 year ago

Just so we don't forget. A full backup and restore needs to also take into account keycloak and conda-store databases.

costrouc commented 1 year ago

With this in the works now there is pressure for me to develop a backup/restore solution for conda-store :slightly_smiling_face:.

costrouc commented 1 year ago

A few things to note, given the cloud-native nature of this solution, we will need a way to manage secrets (unless we want to create service-account credentials during the deployment). This means that we will likely need to consider one of the proposed solutions to secret management (see the https://github.com/nebari-dev/governance/issues/29 and https://github.com/nebari-dev/governance/issues/32).

Absolutely. Yes we need a proper secret storage before adopting something like this to avoid leaking secrets in the argo workflows.

iameskild commented 1 year ago

@trallard @dharhas @costrouc Is the backup/restore feature something worth including in a future roadmap? Several of our recent releases have required users to backup/restore their data and this would make that process a lot smoother.

costrouc commented 1 year ago

@iameskild yes this is something that should be included in the future roadmap. I'm going to suggest that we use the extension PR work to add a subcommand for backup and restore in a separate nebari repository.

I think that backup/restore is something that we will need to incrementally improve independent of nebari. Also would allow us to release more frequently. I see several iterations that we should aim for:

What to backup:

[ ] shared directory, group/user data
[ ] conda-store state
[ ] keycloak state

Where to backup to:

[ ] directory when nebari backup is run returning large tarball
[ ] external s3 bucket

restore should have these similar requirements.

Priority in my mind:

backup/restore command which can backup a shared directory to local directory and then restore the state
backup/restore additionally keycloak
backup/restore to external s3 bucket
backup/restore conda-store as well

iameskild commented 1 year ago

@costrouc creating these as subcommands makes a lot of sense!

And to confirm, we will be relying on the kubernetes (and keycloak) python client directly and won't be using terraform?

And would the NFS backup be a single tar.gz file? Perhaps we could look into restic again. The benefits include only backing up the diff so it would be very quick after the initial backup. As I've done elsewhere, we can perform backups for individual directories (users) so in the event of a failure during the restore, we can pick back up more reliably.

iameskild commented 1 year ago

We could also include a backup gitops workflow that runs on a daily scheduler.

kcpevey commented 9 months ago

Will this include extension data such as mlflow?

This will need to be tested on both AWS and AWS GovCloud for JATIC

viniciusdc commented 2 months ago

superseded by #2648

nebari-dev / nebari

Backup and Restore Implementation #743

Summary

Proposed implementation

Acceptance Criteria

Tasks to complete

Related to

Steps

Obervations

Backup

Restore

Requirements and implementation

Resource details

Backup workflow

Restore workflow

Results

Notes and open items