Closed costrouc closed 2 months ago
Some good point raised during our meeting. The velero backup will only apply to block storage pv/pvcs deployed within kubernetes on the specific cloud providers. We do use efs from aws and this would be out of scope.
Additionally I forsee the conversation around storing credentials within the qhub-config.yaml for s3 bucket access. For now assume that storing credentials within the bucket is okay. This is because other future PRs will solve storing secrets in the configuraition.
Assigning:
Some issues raised:
Looking at current alternatives to Valero:
Cost eg Paid/limited free/free |
Source eg OSS/source available/closed source |
Features eg Full backup/etcd only |
Tool/platform | Presently Maintained | |
---|---|---|---|---|---|
Portworx PX-Backup | Limited 5TB/5 nodes/30 vol | No, relies on OSS libs | Full | Tool | Yes |
Kasten | Limited to 10 nodes | No | Full | Tool | Yes |
Kubedr ->CloudCasa |
Free |
OSS - Apache |
etcd only |
Tool |
Alpha/unmaintained - 03/2020 |
Rancher/Longhorn | Free | OSS | Full | Platform/storage tool | Yes |
Stash by AppsCode | Free Community Edition | OSS - must apply for 1 yr free license | Limited - no local/auto/batch backup in Free Edition | Tool | Yes |
tl;dr the market is very limited vis á vis FOSS alternatives to Velero.
Wanted to document a solution I got working on prem via minikube
and via digital ocean. This seems to be cloud agnostic for backups which seems promising. In addition I didn't realize how complete the velero backups are. They include all of the resources as well and give strong controls on the backup.
minikube start --driver=docker --kubernetes-version=v1.21.3
To start the minikube cluster. Then we need to create the minio s3 backup. Sure we could use a cloud based backup.
apiVersion: v1
kind: Service
metadata:
name: minio
spec:
type: NodePort
ports:
- name: "9000"
nodePort: 30900
port: 9000
targetPort: 9000
- name: "9001"
nodePort: 30901
port: 9001
targetPort: 9001
selector:
app: minio
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: minio-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: minio
labels:
app: minio
spec:
replicas: 1
selector:
matchLabels:
app: minio
strategy:
type: Recreate
template:
metadata:
labels:
app: minio
spec:
containers:
- name: minio
image: minio/minio:RELEASE.2021-08-25T00-41-18Z
args:
- "-c"
- "mkdir -p /data/velero && /usr/bin/minio server /data --console-address 0.0.0.0:9001"
command:
- "sh"
env:
- name: MINIO_ACCESS_KEY
value: admin
- name: MINIO_SECRET_KEY
value: password
ports:
- containerPort: 9000
- containerPort: 9001
volumeMounts:
- mountPath: /data
name: minio-claim
restartPolicy: Always
volumes:
- name: minio-claim
persistentVolumeClaim:
claimName: minio-claim
and then an example application to test the backup with
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pod-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
name: hellopod
spec:
containers:
- name: hello
image: busybox
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- "date >> /data/example.txt; sleep 100000"
volumeMounts:
- mountPath: /data
name: pod-claim
restartPolicy: OnFailure
volumes:
- name: pod-claim
persistentVolumeClaim:
claimName: pod-claim
Then kubectl apply both of these charts. Next we install velero and also install velero on the cluster. We need to create a file for the credentials for our S3 bucket and how to access it.
[default]
aws_access_key_id = admin
aws_secret_access_key = password
And then we download velero
wget https://github.com/vmware-tanzu/velero/releases/download/v1.6.3/velero-v1.6.3-linux-amd64.tar.gz
tar -xf *.tar.gz
cd velero-*
./velero install --provider=aws --plugins velero/velero-plugin-for-aws:v1.0.0 --use-restic --use-volume-snapshots=false --bucket=velero --secret-file /tmp/velero/credentials.txt --backup-location-config region=default,s3ForcePathStyle="true",s3Url=http://minio.default.svc:9000
Finally lets demonstrate a backup
./velero backup create anexample --default-volumes-to-restic=true
You can check that a backup was performed successfully by visiting the web ui for the minio. The minikube ip address is posible via minikube ip
and the port is 30900
additionally you can also access the ui via port forwarding https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/. I also breifly tested deleting the pod resource and then restoring the volume. This seemed to work though I didn't test this as much. However, the backup is clearly happening on DO and minikube. On prem velero has issues with hostPaths as pvc volumes however outside of testing I would consider this a rare circumstance since for any true multinode kubernetes deploements hostPaths cannot work.
This also looks like it will be able to backup efs and cloud specific pvcs :smile:. So good news @brl0! Still very much POC but I believe this tool will work great for our use case and then some.
At a high-level it appears that a Velero + Restic backup will mostly likely work for our purposes. I started my testing on a minikube cluster but kept running into errors (they might still be user-errors) so I decided to repurpose an existing AWS deployment I was using; I had much better success backing up and restoring on the AWS QHub cluster (steps outlined below). There are still a handful of things to test and consider:
qhub --version
0.3.13
# brew install velero
velero version
Client:
Version: v1.7.0
Git commit: -
Server:
Version: v1.7.0
velero install
--provider=aws
--plugins=velero/velero-plugin-for-aws:v1.3.0
--use-restic
--default-volumes-to-restic=true
--bucket=$BUCKET
--secret-file ./credentials.txt
--backup-location-config region=$REGION,s3ForcePathStyle=true,s3Url=http://s3.$REGION.amazonaws.com
--wait
--snapshot-location-config region=$REGION
velero backup create test --include-namespaces=dev --wait
# Tear down QHub
python -m qhub destroy -c qhub-config.yaml
# Redeploy with same config file
python -m qhub deploy -c qhub-config.yaml
# Prepare for NFS restore
# - delete nfs-mount-dev-share, conda-store-dev-share PVCs
# - delete jupyterhub-sftp Deployment
# Update PV reclaim-status from "Released" to "Available"
k patch pv nfs-mount-dev-share -p '{spec:{claimRef: null}}'
k patch pv conda-store-dev-share -p '{spec:{claimRef: null}}'
velero create restore test-restore --from-backup=test
Partil-Fail
status due to it's inability to restore the jupyterhub-sftp
volume home
shared
folder (see below)Errors:
Velero: pod volume restore failed: error restoring volume: error running restic restore, cmd=restic restore --repo=s3:http://s3.eu-west-2.amazonaws.com/eaeqhubbu/restic/dev --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic bd6d4b69 --target=., stdout=restoring <Snapshot bd6d4b69 of [/host_pods/f3eaeaa7-7755-4c99-b79c-4fbb5029ee55/volumes/kubernetes.io~nfs/nfs-mount-dev-share] at 2021-11-11 23:39:55.972809323 +0000 UTC by root@velero> to .
, stderr=ignoring error for /home/iameskild/shared: Symlink: symlink /home/shared /host_pods/003f3c3d-d922-4784-a26c-ea60dd639775/volumes/kubernetes.io~nfs/nfs-mount-dev-share/home/iameskild/shared: file exists
Fatal: There were 1 errors
The is verified working on AWS and GCP:
In order to specify a volume for restic restoration, we need to annotate a pod with backup.velero.io/backup-volumes: <pods_name_for_persistentvolume>
. I decided to do this by creating a pod specifically for this purpose. With the following saved as custom_pod.yaml
I added it to the cluster with kubectl apply -f custom_pod.yaml
kind: Pod
apiVersion: v1
metadata:
name: restic-placeholder
namespace: dev
annotations:
backup.velero.io/backup-volumes: home
spec:
volumes:
- name: home
persistentVolumeClaim:
claimName: "nfs-mount-dev-share"
containers:
- name: placeholder
image: ubuntu
command: ["sleep", "36000000000000"]
volumeMounts:
- mountPath: "/data"
name: home
To avoid errors on mounts that don't need to be backed up, set the following labels to exclude the persistentvolumeclaims like so:
kubectl label pvc conda-store-dev-share velero.io/exclude-from-backup=true -n dev
kubectl label pvc hub-db-dir velero.io/exclude-from-backup=true -n dev
kubectl label pvc qhub-conda-store-storage velero.io/exclude-from-backup=true -n dev
With this setup, velero can be installed with the default-volumes-to-restic=false
:
velero install \
--provider=aws \
--plugins=velero/velero-plugin-for-aws:v1.3.0 \
--use-restic \
--default-volumes-to-restic=false \
--bucket=$BUCKET \
--secret-file ./credentials.txt \
--backup-location-config region=$REGION,s3ForcePathStyle=true,s3Url=http://s3.$REGION.amazonaws.com \
--wait \
--snapshot-location-config region=$REGION
The backup is created with:
velero backup create qhub-backup --include-namespaces=dev --wait
Note that all user notebook need to be shut down as well. Existing user sessions will maintain a connection to the persistent volume claim and prevent deletion. We delete the resources that are using the nfs-mount-dev-share
with the commands below:
kubectl delete deployments qhub-jupyterhub-sftp -n dev
kubectl delete pod restic-placeholder -n dev
kubectl delete pvc nfs-mount-dev-share -n dev
kubectl patch pv nfs-mount-dev-share -p '{"spec":{"claimRef": null}}'
With these gone, the restore can be initiatied with:
velero restore create qhub-restore --from-backup qhub-backup
Note that the restore will say that it partially failed. This is because there is already a symlink for /home/shared
. However, data in the user directories as well as the shared directories gets restored as expected.
To copy the backed up data for home and shared to the current working directory, run the following command:
restic -r s3:s3.amazonaws.com/<backup_bucket>/restic/dev --verbose=2 restore latest --target .
This will prompt for a password, which will always be static-passw0rd
Now that we have Argo-Workflows enabled, we can run backup and restore workflows much more easily; with a few small updates, we can schedule backups as cron-workflows.
This backup/restore solution also relies on restic
to perform the actual backup and restore.
Here is an example of how we might want to run backup and restores.
We will need:
restic
and the cloud-specific CLI (gcloud
, awscli
, etc.)RESTIC_REPOSITORY
RESTIC_PASSWORD
GOOGLE_APPLICATION_CREDENTIALS
- specific to the service account mentioned in 2.GOOGLE_PROJECT_ID
I created this image so I could test this proposed solution (more on the results below). However in the long-term, we would likely require an image to be built and pushed to our open registries for each of the cloud providers that we support (AWS, Azure, DO, GCP).
Again, to test the feasibility of this solution, I created the following secrets:
apiVersion: v1
kind: Secret
metadata:
name: google-application-credentials
namespace: dev
type: Opaque
data:
GOOGLE_APPLICATION_CREDENTIALS: ---
---
apiVersion: v1
kind: Secret
metadata:
name: google-project-id
namespace: dev
type: Opaque
data:
GOOGLE_PROJECT_ID: ---
---
apiVersion: v1
kind: Secret
metadata:
name: restic-repo
namespace: dev
type: Opaque
data:
RESTIC_REPOSITORY: ---
---
apiVersion: v1
kind: Secret
metadata:
name: restic-password
namespace: dev
type: Opaque
data:
RESTIC_PASSWORD: ---
And then the actual workflows themselves.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: backup-workflow
namespace: dev
spec:
entrypoint: backup
volumes:
- name: google-application-credentials
secret:
secretName: google-application-credentials
- name: nfs-volume
persistentVolumeClaim:
claimName: "jupyterhub-dev-share"
templates:
- name: backup
container:
# image I created above
image: ghcr.io/iameskild/restic:f4318fb95f1d63d414f4a3743d9488b4107d7367
env:
- name: GOOGLE_PROJECT_ID
valueFrom:
secretKeyRef:
name: google-project-id
key: GOOGLE_PROJECT_ID
- name: RESTIC_REPOSITORY
valueFrom:
secretKeyRef:
name: restic-repo
key: RESTIC_REPOSITORY
- name: RESTIC_PASSWORD
valueFrom:
secretKeyRef:
name: restic-password
key: RESTIC_PASSWORD
volumeMounts:
- mountPath: "/var/secrets/google"
name: google-application-credentials
# mount the NFS drive
- mountPath: "/exports"
name: nfs-volume
command: [sh, -c]
args: ['
gcloud auth activate-service-account --key-file=/var/secrets/google/GOOGLE_APPLICATION_CREDENTIALS;
restic init;
restic backup /exports
']
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: restore-workflow
namespace: dev
spec:
entrypoint: restore
volumes:
- name: google-application-credentials
secret:
secretName: google-application-credentials
- name: nfs-volume
persistentVolumeClaim:
claimName: "jupyterhub-dev-share"
templates:
- name: restore
container:
# again, same image I created above
image: ghcr.io/iameskild/restic:f4318fb95f1d63d414f4a3743d9488b4107d7367
env:
- name: GOOGLE_PROJECT_ID
valueFrom:
secretKeyRef:
name: google-project-id
key: GOOGLE_PROJECT_ID
- name: RESTIC_REPOSITORY
valueFrom:
secretKeyRef:
name: restic-repo
key: RESTIC_REPOSITORY
- name: RESTIC_PASSWORD
valueFrom:
secretKeyRef:
name: restic-password
key: RESTIC_PASSWORD
volumeMounts:
- mountPath: "/var/secrets/google"
name: google-application-credentials
# mount the NFS drive
- mountPath: "/exports"
name: nfs-volume
command: [sh, -c]
args: ['
gcloud auth activate-service-account --key-file=/var/secrets/google/GOOGLE_APPLICATION_CREDENTIALS;
restic restore latest --target=/
']
I have successfully tested this solution - including both the backup and restore steps - on a live cluster running on GCP with a backup residing on GCS 🎉
A few things to note, given the cloud-native nature of this solution, we will need a way to manage secrets (unless we want to create service-account credentials during the deployment). This means that we will likely need to consider one of the proposed solutions to secret management (see the SOPS RFD and Vault RFD).
Obviously this is just a POC and will need to converted to a Terraform script, that said, this solution looks very promising and I am curious what the rest of the team thinks.
Just so we don't forget. A full backup and restore needs to also take into account keycloak and conda-store databases.
With this in the works now there is pressure for me to develop a backup/restore solution for conda-store :slightly_smiling_face:.
A few things to note, given the cloud-native nature of this solution, we will need a way to manage secrets (unless we want to create service-account credentials during the deployment). This means that we will likely need to consider one of the proposed solutions to secret management (see the https://github.com/nebari-dev/governance/issues/29 and https://github.com/nebari-dev/governance/issues/32).
Absolutely. Yes we need a proper secret storage before adopting something like this to avoid leaking secrets in the argo workflows.
@trallard @dharhas @costrouc Is the backup/restore feature something worth including in a future roadmap? Several of our recent releases have required users to backup/restore their data and this would make that process a lot smoother.
@iameskild yes this is something that should be included in the future roadmap. I'm going to suggest that we use the extension PR work to add a subcommand for backup
and restore
in a separate nebari repository.
I think that backup/restore is something that we will need to incrementally improve independent of nebari. Also would allow us to release more frequently. I see several iterations that we should aim for:
What to backup:
Where to backup to:
restore
should have these similar requirements.
Priority in my mind:
@costrouc creating these as subcommands makes a lot of sense!
And to confirm, we will be relying on the kubernetes (and keycloak) python client directly and won't be using terraform?
And would the NFS backup be a single tar.gz file? Perhaps we could look into restic again. The benefits include only backing up the diff so it would be very quick after the initial backup. As I've done elsewhere, we can perform backups for individual directories (users) so in the event of a failure during the restore, we can pick back up more reliably.
We could also include a backup gitops workflow that runs on a daily scheduler.
Will this include extension data such as mlflow?
This will need to be tested on both AWS and AWS GovCloud for JATIC
superseded by #2648
Summary
QHub is currently lacking a backup and restore solution. Initially this issue was not sufficiently complex since all state was stored on a single nfs filestore. We talked about having a kubernetes cron job to run daily restic to update the filesystem to a single s3 bucket. However now there are starting to be databases and state stored in several other pvcs within QHub. We expect this to grow so we need a generic solutions that allows us to backup/restore all storage within a cluster. We are proposing kubernetes backups using velero which looks to be a well adopted open source solution for backup and restore.
Proposed implementation
We realize this is a large issue and it will be most likely easiest to approach this problem in steps.
The first step would be to deploy the velero helm chart within QHub. There are other examples of [deploying a helm chart within QHub in PRs. This being the most similar one https://github.com/Quansight/qhub/pull/733. This will only deploy the velero agent on the kubernetes cluster. This should be configured via a qhub-config.yaml configuration setting. The PR above gives an example of adding this setting. There will additionally be a key
credentials
that takes an arbitrary dict of credentials to pass on to the helm chart. See https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/README.md#option-1-cli-commands. These credentials will be used to setup file backups and block storage backups.schedule
will control the frequency of regular backups. Backups should be an optional feature that is disabled by default.Next once velero is deployed on the cluster there should be the ability to trigger a backup manually. Similar to how we handle terraform https://github.com/Quansight/qhub/blob/main/qhub/provider/terraform.py#L23. Since velero is a go binary it should be possible to transparently download the velero binary https://github.com/vmware-tanzu/velero/releases/tag/v1.6.1 and expose it in the cli behind a
qhub backup
andqhub restore
command. For now we would like to create a velero provider in https://github.com/Quansight/qhub/tree/main/qhub/provider that can trigger a backup and restore of the qhub storage.Initially we would like a simple
qhub deploy
andqhub restore
command. Eventually we could imagine this command growing into more complicated backups but we realize this problem is complicated enough as it is scoped.Additionally there should be documentation added for the admin and dev guide.
Acceptance Criteria
qhub backup
should trigger a manual backup of the cluster with files being backed up to s3 bucketqhub restore
should trigger a restore action that will refresh the contents of pvcs within cluster (this is less well understood at the moment and may not be possible).Tasks to complete
Related to