Open nknoii opened 7 months ago
@nknoii
From the command line you are using, you have specified both --default-volumes-to-fs-backup
and --snapshot-move-data
. --default-volumes-to-fs-backup
activates the fs-backup method and --snapshot-move-data
is for CSI snapshot data movement backup method. If they are both specified, as the current logic, fs-backup method takes preference.
However, fs-backup is not a consistent backup type, so your backup encountered data inconsistency.
Therefore, you need to remove --default-volumes-to-fs-backup
so that you can get a crash consistent backup with CSI snapshot data movement backup method. For sure, CSI snapshot data movement backup requires your CSI driver supports CSI snapshot.
Probably, we need to add some checks to prevent these two backup flags coexist.
Hey @Lyndon-Li
$ velero backup create my-backup --include-namespaces postgres --snapshot-volumes --snapshot-move-data
I ran this command. Backup and restore completed successfully. but I'm still facing the same issue where tables are not restored.
ps: I'm using aws s3 bucket for the backup.
Please collect velero logs by running velero debug
and share us the log bundle.
Please collect velero logs by running
velero debug
and share us the log bundle.
From the log bundle, there is no logs for data movement backup/restore activities.
From the restore log, a restore was launched for backup my-backup-20240306101305
. Looks like the backup is a data movement backup, but the DataUpload was not generated. Probably, the data movement backup for the PVC was not completed.
time="2024-03-06T04:43:07Z" level=warning msg="Got 0 DataUpload result. Expect one." error="dataupload result number is not expected" logSource="pkg/restore/restore.go:2013" restore=velero/my-backup-20240306101305
time="2024-03-06T04:43:07Z" level=info msg="Start DataMover restore." Action=PVCRestoreItemAction PVC=postgres/postgresql-persistent-storage-postgresql-0 Restore=velero/my-backup-20240306101305 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/restore/pvc_action.go:174" pluginName=velero-plugin-for-csi restore=velero/my-backup-20240306101305
time="2024-03-06T04:43:07Z" level=warning msg="PVC doesn't have a DataUpload for data mover. Return." Action=PVCRestoreItemAction PVC=postgres/postgresql-persistent-storage-postgresql-0 Restore=velero/my-backup-20240306101305 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/restore/pvc_action.go:180" pluginName=velero-plugin-for-csi restore=velero/my-backup-20240306101305
@nknoii Please share the log bundle for backup my-backup-20240306101305
so that we can further troubleshoot (if the restore is to a different cluster, the backup log should be on the source cluster). Or you can run a new backup on the source cluster and make sure there is no warning about PVC skip and then run another restore.
From the log bundle, there is no logs for data movement backup/restore activities. From the restore log, a restore was launched for backup
my-backup-20240306101305
. Looks like the backup is a data movement backup, but the DataUpload was not generated. Probably, the data movement backup for the PVC was not completed.time="2024-03-06T04:43:07Z" level=warning msg="Got 0 DataUpload result. Expect one." error="dataupload result number is not expected" logSource="pkg/restore/restore.go:2013" restore=velero/my-backup-20240306101305 time="2024-03-06T04:43:07Z" level=info msg="Start DataMover restore." Action=PVCRestoreItemAction PVC=postgres/postgresql-persistent-storage-postgresql-0 Restore=velero/my-backup-20240306101305 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/restore/pvc_action.go:174" pluginName=velero-plugin-for-csi restore=velero/my-backup-20240306101305 time="2024-03-06T04:43:07Z" level=warning msg="PVC doesn't have a DataUpload for data mover. Return." Action=PVCRestoreItemAction PVC=postgres/postgresql-persistent-storage-postgresql-0 Restore=velero/my-backup-20240306101305 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/restore/pvc_action.go:180" pluginName=velero-plugin-for-csi restore=velero/my-backup-20240306101305
@nknoii Please share the log bundle for backup
my-backup-20240306101305
so that we can further troubleshoot (if the restore is to a different cluster, the backup log should be on the source cluster). Or you can run a new backup on the source cluster and make sure there is no warning about PVC skip and then run another restore.
postgres-backup.log I did a new backup and there were some skipping warnings.
The volume is built on hostpath, it is not supported by data movement backup or fs-backup:
time="2024-03-07T03:28:45Z" level=info msg="Skipping PVC postgres/postgresql-persistent-storage-postgresql-0, associated PV postgresql-pv is not a CSI volume" backup=velero/postgres-backup cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/backup/pvc_action.go:99" pluginName=velero-plugin-for-csi
time="2024-03-07T03:28:47Z" level=warning msg="Volume postgresql-persistent-storage in pod postgres/postgresql-0 is a hostPath volume which is not supported for pod volume backup, skipping" backup=velero/postgres-backup logSource="pkg/podvolume/backupper.go:267" name=postgresql-0 namespace=postgres resource=pods
The volume is built on hostpath, it is not supported by data movement backup or fs-backup:
time="2024-03-07T03:28:45Z" level=info msg="Skipping PVC postgres/postgresql-persistent-storage-postgresql-0, associated PV postgresql-pv is not a CSI volume" backup=velero/postgres-backup cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/backup/pvc_action.go:99" pluginName=velero-plugin-for-csi time="2024-03-07T03:28:47Z" level=warning msg="Volume postgresql-persistent-storage in pod postgres/postgresql-0 is a hostPath volume which is not supported for pod volume backup, skipping" backup=velero/postgres-backup logSource="pkg/podvolume/backupper.go:267" name=postgresql-0 namespace=postgres resource=pods
I see, velero supports local persistent volumes right? so I changed it to that. but it's still skipping the pvc.
apiVersion: v1
kind: Service
metadata:
name: postgresql
namespace: postgres
labels:
app: postgresql
spec:
ports:
- port: 5432
selector:
app: postgresql
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: postgresql-pv
spec:
capacity:
storage: 1Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /var/lib/postgresql/data
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- kind-control-plane
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgresql-pvc
namespace: postgres
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: local-storage
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgresql
namespace: postgres
spec:
selector:
matchLabels:
app: postgresql
serviceName: postgresql
replicas: 1
template:
metadata:
labels:
app: postgresql
spec:
containers:
- name: postgresql
image: postgres:latest
ports:
- containerPort: 5432
env:
- name: POSTGRES_PASSWORD
value: password
volumeMounts:
- name: postgresql-persistent-storage
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: postgresql-persistent-storage
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
I see, velero supports local persistent volumes right?
The data movement backup doesn't support local PV, fs-backup supports it. But fs-backup is not consistent.
I see, velero supports local persistent volumes right?
The data movement backup doesn't support local PV, fs-backup supports it. But fs-backup is not consistent.
If it doesn't support local pv, what should I do? sorry, I didn't get you.
@nknoii, If you can use CSI PVs, snapshot data mover backups can be done. They take CSI snapshot first and then backs up the data. But this method cannot be used for PVs that do not support CSI snapshots. For example, "local" type PVs (https://kubernetes.io/docs/concepts/storage/volumes/#local) do not support snapshots so you can't use snapshot data mover method.
But you can use "File system backup" method (enabled by the option "--default-volumes-to-fs-backup") but the only issue there, as @Lyndon-Li pointed out, is that backup is done from live PV. So your Postgres data may be backed up in inconsistent state, depending on DB activity. Note that you can use Velero hooks (https://velero.io/docs/v1.13/backup-hooks/) to quiesce data.
Hope this helps.
@Lyndon-Li "Probably, we need to add some checks to prevent these two backup flags coexist." Actually, we probably don't want that. default-volumes-to-fs-backup just specifies whether we're using opt-in or opt-out. If you have a bunch of volumes you want to use fs-backup for (not compatible with CSI volumes), but a few CSI volumes you want to use datamover with, you'd use all 3 flags and then annotate the datamover volumes to opt in.
Point is that the flags are not contradictory -- collectively, using all 3 means "by default, use fs-backup with individual volume opt-out, and for those opted-out volumes, snapshot the data and move to object storage via datamover.
@nknoii, If you can use CSI PVs, snapshot data mover backups can be done. They take CSI snapshot first and then backs up the data. But this method cannot be used for PVs that do not support CSI snapshots. For example, "local" type PVs (https://kubernetes.io/docs/concepts/storage/volumes/#local) do not support snapshots so you can't use snapshot data mover method.
But you can use "File system backup" method (enabled by the option "--default-volumes-to-fs-backup") but the only issue there, as @Lyndon-Li pointed out, is that backup is done from live PV. So your Postgres data may be backed up in inconsistent state, depending on DB activity. Note that you can use Velero hooks (https://velero.io/docs/v1.13/backup-hooks/) to quiesce data.
Hope this helps.
I tested by using '--default-volumes-to-fs-backup', but I'm still encountering the same issue. postgres data is not being restored
@sseago
@Lyndon-Li "Probably, we need to add some checks to prevent these two backup flags coexist." Actually, we probably don't want that. default-volumes-to-fs-backup just specifies whether we're using opt-in or opt-out. If you have a bunch of volumes you want to use fs-backup for (not compatible with CSI volumes), but a few CSI volumes you want to use datamover with, you'd use all 3 flags and then annotate the datamover volumes to opt in.
Point is that the flags are not contradictory -- collectively, using all 3 means "by default, use fs-backup with individual volume opt-out, and for those opted-out volumes, snapshot the data and move to object storage via datamover.
These logics are correct as Velero's current behavior, that is, default-volumes-to-fs-backup
+ opt-in/opt-out + CSI plugin availability/feature enability + snapshot-move-data
work together to decide the final backup method.
And the current behavior is also comprehensive, as the case you have mentioned, some volumes support CSI snapshot, some don't.
However, we achieve this comprehensiveness by leaving the complexity to users, they need to decide the complex-to-understand and less user-friendly combination:
default-volumes-to-fs-backup
+ opt-in/opt-out + CSI plugin availability/feature enability + snapshot-move-data
.default-volumes-to-fs-backup
+ snapshot-move-data
, what users want is --- I want to run consistent backup as more as possible, for others, I use fs-backup. So for the volumes that support CSI snapshot, snapshot-move-data
should take preference even though opt-out is not specified. However, at present, Velero just takes fs-backup for them. This is a big topic since we need to change the entire behavior and it is out of the scope of the current issue. So let's continue this thinking and discussion, at present, my immediate idea is that at least we need to block/warn users if these two flags come together.
I tested by using '--default-volumes-to-fs-backup', but I'm still encountering the same issue. postgres data is not being restored
@nknoii can you make sure PV can be dynamically provisioned by local volume provisioner in your cluster? Checking the StatefulSet yaml you provided above, I didn't see there is a storageClassName
field in volumeClaimTemplates
section.
Please also provide us with log bundle for better troubleshooting.
I tested by using '--default-volumes-to-fs-backup', but I'm still encountering the same issue. postgres data is not being restored
@nknoii can you make sure PV can be dynamically provisioned by local volume provisioner in your cluster? Checking the StatefulSet yaml you provided above, I didn't see there is a
storageClassName
field involumeClaimTemplates
section.Please also provide us with log bundle for better troubleshooting.
Yes, I missed the storageClassName. Btw, now I tried installing it with a helm chart.
$ helm install postgres bitnami/postgresql --namespace postgres
$ velero backup create postgres-backup --include-namespaces postgres --default-volumes-to-fs-backup
Name: postgres-backup
Namespace: velero
Labels: velero.io/storage-location=dev-velero
Annotations: velero.io/resource-timeout=10m0s
velero.io/source-cluster-k8s-gitversion=v1.27.3
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=27
Phase: Completed
Warnings:
Velero:
Name: postgres-backup
Namespace: velero
Labels: velero.io/storage-location=dev-velero
Annotations: velero.io/resource-timeout=10m0s
velero.io/source-cluster-k8s-gitversion=v1.27.3
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=27
Phase: Completed
Warnings:
Velero:
time="2024-03-10T09:24:22Z" level=info msg="Setting up backup temp file" backup=velero/postgres-backup logSource="pkg/controller/backup_controller.go:617"
time="2024-03-10T09:24:22Z" level=info msg="Setting up plugin manager" backup=velero/postgres-backup logSource="pkg/controller/backup_controller.go:624"
time="2024-03-10T09:24:22Z" level=info msg="Getting backup item actions" backup=velero/postgres-backup logSource="pkg/controller/backup_controller.go:628"
time="2024-03-10T09:24:22Z" level=info msg="Setting up backup store to check for backup existence" backup=velero/postgres-backup logSource="pkg/controller/backup_controller.go:633"
time="2024-03-10T09:24:23Z" level=info msg="Writing backup version file" backup=velero/postgres-backup logSource="pkg/backup/backup.go:197"
time="2024-03-10T09:24:23Z" level=info msg="Including namespaces: postgres" backup=velero/postgres-backup logSource="pkg/backup/backup.go:203"
time="2024-03-10T09:24:23Z" level=info msg="Excluding namespaces:
Then i deleted the postgres namespace and restored it.
$ velero restore create --from-backup postgres-backup
Name: postgres-backup-20240310150017
Namespace: velero
Labels:
Name: postgres-backup-20240310150017
Namespace: velero
Labels:
Once the restore completed, I tried to log in to the database, but it says the password is wrong. And also, there's a 'Defaulted container "postgresql" out of: postgresql, restore-wait (init)' message when I tried to access it.
$ kubectl exec -it -n postgres postgres-postgresql-0 -- psql -U postgres
Defaulted container "postgresql" out of: postgresql, restore-wait (init)
Password for user postgres:
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: password authentication failed for user "postgres"
command terminated with exit code 2
@Lyndon-Li regarding this: "About user-friendly, simply speaking, we need to consider users' true priority. For example, if they specify default-volumes-to-fs-backup + snapshot-move-data, what users want is --- I want to run consistent backup as more as possible, for others, I use fs-backup. So for the volumes that support CSI snapshot, snapshot-move-data should take preference even though opt-out is not specified. However, at present, Velero just takes fs-backup for them."
Note that @shubham-pampattiwar's current work on adding volume policies for fs-backup and snapshot will enable this. One of the primary example use cases there is "use snapshot when available, fs-backup when not".
What steps did you take and what happened:
What did you expect to happen: After restoring the backup into a new cluster, the tables and data should be available for access in the new cluster.
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl exec -it -n postgres postgresql-0 -- psql -U postgres -d postgres
Type "help" for help. postgres=# \d List of relations Schema | Name | Type | Owner
--------+-------------+----------+---------- public | t_50000 | table | postgres public | test | table | postgres public | test_id_seq | sequence | postgres
(3 rows)
Backup request "my-backup" submitted successfully.
Run
velero backup describe my-backup
orvelero backup logs my-backup
for more details.Name: my-backup Namespace: velero Labels: velero.io/storage-location=velero-dev Annotations: velero.io/resource-timeout=10m0s velero.io/source-cluster-k8s-gitversion=v1.27.3 velero.io/source-cluster-k8s-major-version=1 velero.io/source-cluster-k8s-minor-version=27
Phase: Completed
Warnings: Velero:
Cluster:
Namespaces:
postgres: resource: /pods name: /postgresql-0
Namespaces: Included: postgres Excluded:
Resources: Included: * Excluded:
Cluster-scoped: auto
Label selector:
Or label selector:
Storage Location: velero-dev
Velero-Native Snapshot PVs: true Snapshot Move Data: true Data Mover: velero
TTL: 336h0m0s
CSISnapshotTimeout: 10m0s ItemOperationTimeout: 4h0m0s
Hooks:
Backup Format Version: 1.1.0
Started: 2024-03-05 20:17:33 +0330 +0330 Completed: 2024-03-05 20:17:42 +0330 +0330 Expiration: 2024-03-19 20:17:33 +0330 +0330
Total items to be backed up: 16 Items backed up: 16
Velero-Native Snapshots:
psql (16.2 (Debian 16.2-1.pgdg120+2))
Type "help" for help.
postgres=# /d postgres=# select * from test; ERROR: relation "test" does not exist
LINE 1: select * from test; ^ postgres=#