replicatedhq / troubleshoot

Preflight Checks and Support Bundles Framework for Kubernetes Applications
https://troubleshoot.sh
Apache License 2.0
544 stars 93 forks source link

Possible improvements to Velero collector/analyzer #1500

Closed xavpaice closed 6 months ago

xavpaice commented 7 months ago

Describe the rationale for the suggested feature.

For the Replicated team, a number of support issues raised are associated with Velero. The information in the Velero analyzer is useful, but not quite complete.

I would like to review the collector/analyzer for Velero, to see what improvements can be made that would have the most impact on our being able to solve support issues faster.

See https://github.com/vmware-tanzu/velero/issues/new?assignees=&labels=&projects=&template=bug_report.md for the kind of things that Velero themselves ask for information.

If we are able to produce a useful support bundle and analysis, there's also an opportunity to discuss adding this to the Velero project as a diagnostic tool to help the maintainers.

First step:

Second step:

xavpaice commented 7 months ago

Velero has a velero debug command already which collects a bunch of information.

The definition of done here is to:

nvanthao commented 7 months ago

Current info collected by velero debug bundle

[gerard@gerard-kurl ~]$ velero debug --backup instance-ggs98
2024/03/12 01:11:39 Collecting velero resources in namespace: velero
2024/03/12 01:11:40 Collecting velero deployment logs in namespace: velero
2024/03/12 01:11:40 Collecting log and information for backup: instance-ggs98
2024/03/12 01:11:41 Generated debug information bundle: /home/gerard/bundle-2024-03-12-01-11-39.tar.gz
[gerard@gerard-kurl ~]$ tar -tzf /home/gerard/bundle-2024-03-12-01-11-39.tar.gz
velero-bundle
velero-bundle/backup_describe_instance-ggs98.txt
velero-bundle/backup_instance-ggs98.log
velero-bundle/kubecapture
velero-bundle/kubecapture/core_v1
velero-bundle/kubecapture/core_v1/velero
velero-bundle/kubecapture/core_v1/velero/node-agent-5k98b
velero-bundle/kubecapture/core_v1/velero/node-agent-5k98b/node-agent
velero-bundle/kubecapture/core_v1/velero/node-agent-5k98b/node-agent/node-agent.log
velero-bundle/kubecapture/core_v1/velero/pods-202403120111.6465.json
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/replicated-kurl-util
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/replicated-kurl-util/replicated-kurl-util.log
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/replicated-local-volume-provider
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/replicated-local-volume-provider/replicated-local-volume-provider.log
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero/velero.log
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-aws
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-aws/velero-velero-plugin-for-aws.log
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-gcp
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-gcp/velero-velero-plugin-for-gcp.log
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-microsoft-azure
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-microsoft-azure/velero-velero-plugin-for-microsoft-azure.log
velero-bundle/kubecapture/velero.io_v1
velero-bundle/kubecapture/velero.io_v1/velero
velero-bundle/kubecapture/velero.io_v1/velero/backuprepositories-202403120111.2620.json
velero-bundle/kubecapture/velero.io_v1/velero/backups-202403120111.2612.json
velero-bundle/kubecapture/velero.io_v1/velero/backupstoragelocations-202403120111.2617.json
velero-bundle/kubecapture/velero.io_v1/velero/podvolumebackups-202403120111.2621.json
velero-bundle/kubecapture/velero.io_v1/velero/serverstatusrequests-202403120111.2623.json
velero-bundle/version.txt

Data collected are:

This data is sufficient to troubleshoot related to Velero backup/restore of snapshots.

nvanthao commented 7 months ago

Noters on current Velero analyzer in Troubleshoot

Check PASS
Title: At least 1 Backup Repository configured
Message: Found 1 backup repositories configured and 1 Ready

------------
Check PASS
Title: Velero Logs analysis for kind [node-agent*]
Message: Found 1 log files

------------
Check WARN
Title: Velero logs for pod [/tmp/supportbundle3307708783/support-bundle-2024-03-12T05_13_27/velero/logs/velero-854f967b7f-btw9q/velero.log]
Message: Found error|panic|fatal in velero* pod log file(s)

------------
Check PASS
Title: Velero Logs analysis for kind [velero*]
Message: Found 6 log files

------------
Check PASS
Title: Velero Backups
Message: Found 2 backups

------------
Check PASS
Title: At least 1 Backup Storage Location configured
Message: Found 1 backup storage locations configured and 1 Available

------------
Check PASS
Title: Pod Volume Backups
Message: Found 1 pod volume backups

------------
Check PASS
Title: Velero Status
Message: Velero setup is healthy

------------
nvanthao commented 7 months ago

Velero troubleshooting doc https://github.com/vmware-tanzu/velero/blob/main/site/content/docs/main/troubleshooting.md

Replicated troubleshooting doc https://docs.replicated.com/enterprise/snapshots-troubleshooting-backup-restore