Add the ability to automate and schedule backups

frouioui commented 6 months ago

Description

This Pull Request adds a new CRD called VitessBackupSchedule. Its main goal is to automate and schedule backups of Vitess, taking backups of the Vitess cluster at regular intervals based on a given cron schedule and Strategy. This new CRD is managed by the VitessCluster, like most other components of the vitess-operator, the VitessCluster controller is responsible for the whole lifecycle (creation, update, deletion) of the VitessBackupSchedule object in the cluster. Inside the VitessCluster it is possible to define several VitessBackupSchedules as a list, allowing for multiple concurrent backup schedules.

Among other things, the VitessBackupSchedule object is responsible for creating Kubernetes's Job at the desired time, based on the user-defined schedule. It also keeps track of older jobs and delete them if they are too old, according to user-defined parameters (successfulJobsHistoryLimit & failedJobsHistoryLimit). The jobs created by the VitessBackupSchedule object will use the vtctld Docker Image and will execute a shell command that is generated based on the user-defined strategies. The end user can define as many backup strategy per schedule, each of them mocks what vtctldclient is able to do, the Backup and BackupShard commands are available, a map of extra flags enable the user to give as many flag as they want to vtctldclient.

A new end-to-end test is added to our BuildKite pipeline as part of this Pull Request to test the proper behavior of this new CRD.

Related PRs

Documentation: https://github.com/vitessio/website/pull/1746
Vitessio/vitess: https://github.com/vitessio/vitess/pull/15969

Demonstration

For this demonstration I have setup a Vitess cluster by following the steps in the getting started guide, until the very last step where we must apply the 306_down_shard_0.yaml file. My cluster is then composed of 2 keyspaces: customer with 2 shards, and commerce unsharded. I then modify the 306... yaml file to contain the new backup schedule, as seen in the snippet right below. We want to create two schedules, one for each keyspace. The keyspace customer will have two backup strategies: one for each shard.

apiVersion: planetscale.com/v2
kind: VitessCluster
metadata:
  name: example
spec:
  backup:
    engine: xtrabackup
    locations:
      - volume:
          hostPath:
            path: /backup
            type: Directory
    schedules:
      - name: "every-minute-customer"
        schedule: "* * * * *"
        resources:
          requests:
            cpu: 100m
            memory: 1024Mi
          limits:
            memory: 1024Mi
        successfulJobsHistoryLimit: 2
        failedJobsHistoryLimit: 3
        strategies:
          - name: BackupShard
            keyspaceShard: "customer/-80"
          - name: BackupShard
            keyspaceShard: "customer/80-"
      - name: "every-minute-commerce"
        schedule: "* * * * *"
        resources:
          requests:
            cpu: 100m
            memory: 1024Mi
          limits:
            memory: 1024Mi
        successfulJobsHistoryLimit: 2
        failedJobsHistoryLimit: 3
        strategies:
          - name: BackupShard
            keyspaceShard: "commerce/-"
  images:

Once the cluster is stable, all tablets are serving and ready, I re-apply my yaml file with the backup configuration:

$ kubectl apply -f test/endtoend/operator/306_down_shard_0.yaml 
vitesscluster.planetscale.com/example configured

Immidiately I can check that the new VitessBackupSchedule objects have been created.

$ kubectl get VitessBackupSchedule 
NAME                                          AGE
example-vbsc-every-minute-commerce-ac6ff735   7s
example-vbsc-every-minute-customer-8aaaa771   7s

Now I want to check the pods where the jobs created by VitessBackupSchedule are running. After about 2 minutes, we can see four pods, two for each schedule. The pods are marked as Completed as they finished their job.

$ kubectl get pods
NAME                                                           READY   STATUS             RESTARTS        AGE
...
example-vbsc-every-minute-commerce-ac6ff735-1715897700-nkfzx   0/1     Completed          0              79s
example-vbsc-every-minute-commerce-ac6ff735-1715897760-qr4hp   0/1     Completed          0              19s
example-vbsc-every-minute-customer-8aaaa771-1715897700-rbsmd   0/1     Completed          0              79s
example-vbsc-every-minute-customer-8aaaa771-1715897760-kzn8t   0/1     Completed          0              19s
...

Now let's check our backup:

$ ls -l vtdataroot/backup/example/commerce/- vtdataroot/backup/example/customer/80- vtdataroot/backup/example/customer/-80 

vtdataroot/backup/example/commerce/-:
total 0
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:15 2024-05-16.221502.zone1-0790125915
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:16 2024-05-16.221602.zone1-0790125915

vtdataroot/backup/example/customer/-80:
total 0
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:15 2024-05-16.221502.zone1-2289928654
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:16 2024-05-16.221601.zone1-2289928654

vtdataroot/backup/example/customer/80-:
total 0
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:15 2024-05-16.221511.zone1-4277914223
drwxr-xr-x  10 florentpoinsard  staff  320 May 16 16:16 2024-05-16.221609.zone1-2298643297

$ kubectl get vtb --no-headers
example-commerce-x-x-20240516-221502-2f185d5b-1854be28    2m7s
example-commerce-x-x-20240516-221602-2f185d5b-0a248174    67s
example-customer-80-x-20240516-221511-fefbca6f-8ede9c7d   2m7s
example-customer-80-x-20240516-221609-89028361-d9d1c1e4   67s
example-customer-x-80-20240516-221502-887d89ce-2fc618f4   2m7s
example-customer-x-80-20240516-221601-887d89ce-5b5b0acb   66s

frouioui commented 5 months ago

In commit bc74ab4, I have applied one of the most important suggestion discussed above which is to remove the BackupTablet strategy in favor of BackupKeyspace and BackupCluster. The strategies can be used as follows:

# BackupKeyspace
        strategies:
          - name: BackupKeyspace
            cluster: "example"
            keyspace: "customer"

# BackupCluster
        strategies:
          - name: BackupCluster
            cluster: "example"

Meanwhile, the BackupShard strategy does not change. When ran we can see the following command line argument in the job's pod, which gets executed upon creation of the container:

# BackupKeyspace
Args:
      /bin/sh
      -c
      /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/-80 && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/80-

# BackupCluster
Args:
      /bin/sh
      -c
      /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard commerce/- && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/-80 && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/80-

cc @maxenglander @mattlord

frouioui commented 5 months ago

another thought, might be nice to give users a way to assign annotations, and one or more affinity selection options to the backup runner pods. that way they can influence things scheduling and eviction.

for example, users might not want backup runner pods running on the same nodes as vttablet pods. and they might not want the backup runner pods to get evicted by an unrelated pod after they've been running for a long time.

In e6946fb I have added affinity and annotations in the VitessBackupScheduleTemplate, allowing the user to configure the affinity and annotations they want for their pods that take backups.

planetscale / vitess-operator