replicatedhq / troubleshoot

Preflight Checks and Support Bundles Framework for Kubernetes Applications
https://troubleshoot.sh
Apache License 2.0
545 stars 93 forks source link

Velero Collector and Analyzer #806

Closed diamonwiggins closed 1 year ago

diamonwiggins commented 2 years ago

Describe the rationale for the suggested feature.

Velero is a toolset that allows you to backup/restore Kubernetes resources and persistent volumes. There are Kubernetes clusters where both Troubleshoot and Velero are commonly used and often times there is a lack of information and analysis that happens on the state of Velero in those environments

Describe the feature

A Velero Collector and Analyzer can be added to Support Bundle and Preflight specs.

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: velero
spec:
  collectors:
    - velero: {}
  analyzers:
    - velero: {}

The Velero Collector can collect information such as:

The Velero Analyzer can provide the following analysis:

Additional context

sgalsaleh commented 2 years ago

initial thoughts:

some more collectors:

important note: when the local volume provider plugin is configured, it will be an init container in the velero deployment, we should get the logs for it as well when possible. also, that means that you'd have to pass -c velero to the kubectl logs command, not sure if that matters with the client-go package.

ideas for analyzers:

i'll add more if i think of anything else.

CpuID commented 1 year ago

I'll start getting into this over the next couple of days, once I familiarise myself with Velero itself first 👍

banjoh commented 1 year ago

Running out of memory: Detect how many objects are in storage and warn if the number is close enough to lead restic to run out of memory

xavpaice commented 1 year ago

Some sample logs from recent support issues, that would have been helped by this analyzer:

Permissions issues on the backup location:

open /var/velero-local-volume-provider/velero-lvp-471ddcf356bb/restic/default/index/3a62dce588bba9f315ba1b2fa86f2c73781f42a365ef747e609b27f9ac4c943a: permission denied\n: exit status 1"

Files got removed during backup (e.g. a database without getting an application level backup):

# kubectl logs velero-5cb7cffdc9-8pllw -n velero -f

time="2023-07-20T09:37:09Z" level=error msg="Error backing up item" backup=velero/instance-278xp error="pod volume backup failed: running Restic backup, stderr={\"message_type\":\"error\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-e31a4031-8c5a-4449-b852-3612a3cf22ed/mount/lost+found\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival\",\"item\":\"lost+found\"}\nWarning: at least one source file could not be read\n: exit status 3" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:199" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417" name=vault-0
time="2023-07-20T09:37:09Z" level=error msg="Error backing up item" backup=velero/instance-278xp error="pod volume backup failed: running Restic backup, stderr={\"message_type\":\"error\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/lost+found\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"scan\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/raft\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival\",\"item\":\"lost+found\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"node-id\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/node-id\"}\n{\"message_type\":\"error\",\"error\":{},\"during\":\"archival\",\"item\":\"raft\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"open\",\"Path\":\"vault.db\",\"Err\":13},\"during\":\"archival\",\"item\":\"/host_pods/f2ea9531-ddc7-40ee-a80a-a5a2f8373b0f/volumes/kubernetes.io~csi/pvc-1fd8dceb-52bf-40a5-b400-12119e69fc0a/mount/vault.db\"}\nWarning: at least one source file could not be read\n: exit status 3" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:199" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:417" name=vault-0

inconsistent state between the backupstoragelocation and the object store that velero uses, we resolved this by deleting the default restic repository manually and restarting Velero so that it would be recreated.

running Restic backup, stderr=Fatal: invalid id "7278389d": no matching ID found for prefix "7278389d"

Some other ideas:

banjoh commented 1 year ago

@adamancini velero has velero debug command which generates a velero debug bundle. You might want to check that out and see if there are some things we still can collect that are collected by the command. Its a command run on the host, so it can I think only be run as a host collector, unless we add velero as dependency (last option IMO)

The codebase is https://github.com/vmware-tanzu/velero/blob/main/pkg/cmd/cli/debug/debug.go

diamonwiggins commented 1 year ago

Recently discovered the following working a support issue:

Error getting volume snapshotter for volume snapshot location

When this error is thrown it means that the particular volume won't be backed up likely due to a plugin issue for a particular storage provider. Full error was:

time="2023-08-23T03:10:38Z" level=error msg="Error getting volume snapshotter for volume snapshot location" backup=velero/my-backup-22-08-6 error="rpc error: code = Unknown desc = faile to get address for maya-apiserver/cvc-server service" error.file="/home/travis/gopath/src/github.com/openebs/velero-plugin/pkg/cstor/cstor.go:233" error.function="github.com/openebs/velero-plugin/pkg/cstor.(*Plugin).Init" logSource="pkg/backup/item_backupper.go:524" name=pvc-3e3ada5e-2361-48c7-bcd6-366b698c6207 namespace= persistentVolume=pvc-3e3ada5e-2361-48c7-bcd6-366b698c6207 resource=persistentvolumes volumeSnapshotLocation=local-default

The OpenEBS plugin which only has support for cstor and not localpv was being used instead of a filesystem backup

adamancini commented 1 year ago
adamancini commented 1 year ago
ada@ada-kurl:~/support-bundle-2023-09-19T17_40_36$ tree -L 4
.
├── analysis.json
├── cluster-resources
│   └── pods
│       └── logs
│           └── velero
├── execution-data
│   └── summary.txt
├── velero
│   ├── backuprepositories
│   │   ├── default-default-restic-6bwck.yaml
│   │   └── kurl-default-restic-2lfp4.yaml
│   ├── backups
│   │   ├── annarchy-mfvpt.yaml
│   │   ├── instance-f2m6f.yaml
│   │   └── instance-g9ccf.yaml
│   ├── backupstoragelocations
│   │   └── default.yaml
│   ├── describe-backups-errors.json
│   ├── describe-backups-stderr.txt
│   ├── describe-restores-errors.json
│   ├── describe-restores-stderr.txt
│   ├── get-backups.yaml
│   ├── get-restores.yaml
│   ├── logs
│   │   ├── node-agent-j4zvz
│   │   │   └── node-agent.log -> ../../../cluster-resources/pods/logs/velero/node-agent-j4zvz/node-agent.log
│   │   └── velero-787c5b44b9-8vzth
│   │       ├── replicated-kurl-util.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/replicated-kurl-util.log
│   │       ├── replicated-local-volume-provider.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/replicated-local-volume-provider.log
│   │       ├── velero.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero.log
│   │       ├── velero-velero-plugin-for-aws.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero-velero-plugin-for-aws.log
│   │       ├── velero-velero-plugin-for-gcp.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero-velero-plugin-for-gcp.log
│   │       └── velero-velero-plugin-for-microsoft-azure.log -> ../../../cluster-resources/pods/logs/velero/velero-787c5b44b9-8vzth/velero-velero-plugin-for-microsoft-azure.log
│   ├── podvolumebackups
│   │   ├── annarchy-mfvpt-tclgv.yaml
│   │   ├── instance-f2m6f-2rdkx.yaml
│   │   ├── instance-f2m6f-5tgvb.yaml
│   │   ├── instance-f2m6f-xf6z9.yaml
│   │   ├── instance-f2m6f-xxh6m.yaml
│   │   ├── instance-g9ccf-qpfn2.yaml
│   │   ├── instance-g9ccf-qt9wh.yaml
│   │   └── instance-g9ccf-w6mgw.yaml
│   ├── podvolumerestores
│   │   └── annarchy-mfvpt-h5f2c.yaml
│   └── restores
│       └── annarchy-mfvpt.yaml
└── version.yaml
adamancini commented 1 year ago

installing an older velero (1.9.x) to check custom resource differences

ada@ada-velero-collector:~$ kubectl api-resources | grep velero
backups                                        velero.io/v1                           true         Backup
backupstoragelocations            bsl          velero.io/v1                           true         BackupStorageLocation
deletebackuprequests                           velero.io/v1                           true         DeleteBackupRequest
downloadrequests                               velero.io/v1                           true         DownloadRequest
podvolumebackups                               velero.io/v1                           true         PodVolumeBackup
podvolumerestores                              velero.io/v1                           true         PodVolumeRestore
resticrepositories                             velero.io/v1                           true         ResticRepository
restores                                       velero.io/v1                           true         Restore
schedules                                      velero.io/v1                           true         Schedule
serverstatusrequests              ssr          velero.io/v1                           true         ServerStatusRequest
volumesnapshotlocations                        velero.io/v1                           true         VolumeSnapshotLocation
arcolife commented 1 year ago

analyzer work https://github.com/replicatedhq/troubleshoot/pull/1366