Closed diamonwiggins closed 1 year ago
initial thoughts:
some more collectors:
kubectl logs daemonset/restic -n velero
velero get restores
velero describe backups --details
velero describe restores --details
kubectl get podvolumebackups -n velero -oyaml
kubectl get podvolumerestores -n velero -oyaml
important note: when the local volume provider plugin is configured, it will be an init container in the velero deployment, we should get the logs for it as well when possible. also, that means that you'd have to pass -c velero
to the kubectl logs
command, not sure if that matters with the client-go package.
ideas for analyzers:
i'll add more if i think of anything else.
I'll start getting into this over the next couple of days, once I familiarise myself with Velero itself first 👍
Running out of memory: Detect how many objects are in storage and warn if the number is close enough to lead restic to run out of memory
Some sample logs from recent support issues, that would have been helped by this analyzer:
Permissions issues on the backup location:
open /var/velero-local-volume-provider/velero-lvp-471ddcf356bb/restic/default/index/3a62dce588bba9f315ba1b2fa86f2c73781f42a365ef747e609b27f9ac4c943a: permission denied\n: exit status 1"
Files got removed during backup (e.g. a database without getting an application level backup):
# kubectl logs velero-5cb7cffdc9-8pllw -n velero -f
time="2023-07-20T09:37:09Z" level=error msg="Error backing up item" backup=velero/instance-278xp error="pod volume backup failed: running Restic backup, stderr={\"message_type\":\"error\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-e31a4031-8c5a-4449-b852-3612a3cf22ed/mount/lost+found\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival\",\"item\":\"lost+found\"}\nWarning: at least one source file could not be read\n: exit status 3" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:199" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417" name=vault-0
time="2023-07-20T09:37:09Z" level=error msg="Error backing up item" backup=velero/instance-278xp error="pod volume backup failed: running Restic backup, stderr={\"message_type\":\"error\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/lost+found\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/raft\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival\",\"item\":\"lost+found\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"node-id\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/node-id\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival\",\"item\":\"raft\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"vault.db\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/vault.db\"}\nWarning: at least one source file could not be read\n: exit status 3" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:199" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417" name=vault-0
inconsistent state between the backupstoragelocation and the object store that velero uses, we resolved this by deleting the default restic repository manually and restarting Velero so that it would be recreated.
running Restic backup, stderr=Fatal: invalid id "7278389d": no matching ID found for prefix "7278389d"
Some other ideas:
error level=info msg="stderr: /bin/bash: line 1: 20109 Killed
@adamancini velero has velero debug
command which generates a velero debug bundle. You might want to check that out and see if there are some things we still can collect that are collected by the command. Its a command run on the host, so it can I think only be run as a host collector, unless we add velero as dependency (last option IMO)
The codebase is https://github.com/vmware-tanzu/velero/blob/main/pkg/cmd/cli/debug/debug.go
Recently discovered the following working a support issue:
Error getting volume snapshotter for volume snapshot location
When this error is thrown it means that the particular volume won't be backed up likely due to a plugin issue for a particular storage provider. Full error was:
time="2023-08-23T03:10:38Z" level=error msg="Error getting volume snapshotter for volume snapshot location" backup=velero/my-backup-22-08-6 error="rpc error: code = Unknown desc = faile to get address for maya-apiserver/cvc-server service" error.file="/home/travis/gopath/src/github.com/openebs/velero-plugin/pkg/cstor/cstor.go:233" error.function="github.com/openebs/velero-plugin/pkg/cstor.(*Plugin).Init" logSource="pkg/backup/item_backupper.go:524" name=pvc-3e3ada5e-2361-48c7-bcd6-366b698c6207 namespace= persistentVolume=pvc-3e3ada5e-2361-48c7-bcd6-366b698c6207 resource=persistentvolumes volumeSnapshotLocation=local-default
The OpenEBS plugin which only has support for cstor and not localpv was being used instead of a filesystem backup
velero
commands
ada@ada-kurl:~/support-bundle-2023-09-19T17_40_36$ tree -L 4
.
├── analysis.json
├── cluster-resources
│ └── pods
│ └── logs
│ └── velero
├── execution-data
│ └── summary.txt
├── velero
│ ├── backuprepositories
│ │ ├── default-default-restic-6bwck.yaml
│ │ └── kurl-default-restic-2lfp4.yaml
│ ├── backups
│ │ ├── annarchy-mfvpt.yaml
│ │ ├── instance-f2m6f.yaml
│ │ └── instance-g9ccf.yaml
│ ├── backupstoragelocations
│ │ └── default.yaml
│ ├── describe-backups-errors.json
│ ├── describe-backups-stderr.txt
│ ├── describe-restores-errors.json
│ ├── describe-restores-stderr.txt
│ ├── get-backups.yaml
│ ├── get-restores.yaml
│ ├── logs
│ │ ├── node-agent-j4zvz
│ │ │ └── node-agent.log -> ../../../cluster-resources/pods/logs/velero/node-agent-j4zvz/node-agent.log
│ │ └── velero-787c5b44b9-8vzth
│ │ ├── replicated-kurl-util.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/replicated-kurl-util.log
│ │ ├── replicated-local-volume-provider.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/replicated-local-volume-provider.log
│ │ ├── velero.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero.log
│ │ ├── velero-velero-plugin-for-aws.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero-velero-plugin-for-aws.log
│ │ ├── velero-velero-plugin-for-gcp.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero-velero-plugin-for-gcp.log
│ │ └── velero-velero-plugin-for-microsoft-azure.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero-velero-plugin-for-microsoft-azure.log
│ ├── podvolumebackups
│ │ ├── annarchy-mfvpt-tclgv.yaml
│ │ ├── instance-f2m6f-2rdkx.yaml
│ │ ├── instance-f2m6f-5tgvb.yaml
│ │ ├── instance-f2m6f-xf6z9.yaml
│ │ ├── instance-f2m6f-xxh6m.yaml
│ │ ├── instance-g9ccf-qpfn2.yaml
│ │ ├── instance-g9ccf-qt9wh.yaml
│ │ └── instance-g9ccf-w6mgw.yaml
│ ├── podvolumerestores
│ │ └── annarchy-mfvpt-h5f2c.yaml
│ └── restores
│ └── annarchy-mfvpt.yaml
└── version.yaml
installing an older velero (1.9.x) to check custom resource differences
ada@ada-velero-collector:~$ kubectl api-resources | grep velero
backups velero.io/v1 true Backup
backupstoragelocations bsl velero.io/v1 true BackupStorageLocation
deletebackuprequests velero.io/v1 true DeleteBackupRequest
downloadrequests velero.io/v1 true DownloadRequest
podvolumebackups velero.io/v1 true PodVolumeBackup
podvolumerestores velero.io/v1 true PodVolumeRestore
resticrepositories velero.io/v1 true ResticRepository
restores velero.io/v1 true Restore
schedules velero.io/v1 true Schedule
serverstatusrequests ssr velero.io/v1 true ServerStatusRequest
volumesnapshotlocations velero.io/v1 true VolumeSnapshotLocation
analyzer work https://github.com/replicatedhq/troubleshoot/pull/1366
Describe the rationale for the suggested feature.
Velero is a toolset that allows you to backup/restore Kubernetes resources and persistent volumes. There are Kubernetes clusters where both Troubleshoot and Velero are commonly used and often times there is a lack of information and analysis that happens on the state of Velero in those environments
Describe the feature
A Velero Collector and Analyzer can be added to Support Bundle and Preflight specs.
The Velero Collector can collect information such as:
kubectl logs deploy/velero -n <velero-namespace> -c velero
kubectl get bsl -n <velero-namespace>
kubectl get bsl -n <velero-namespace> -oyaml
kubectl get resticrepositories -n <velero-namespace>
kubectl get resticrepositories -n <velero-namespace> -oyaml
velero get backups
The Velero Analyzer can provide the following analysis:
Additional context