When a backup or restore fails, provide the user information or instructions to find out root cause

vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes

https://velero.io

Apache License 2.0

8.71k stars 1.4k forks source link

When a backup or restore fails, provide the user information or instructions to find out root cause #305

Closed rdodev closed 3 years ago

rdodev commented 6 years ago

Presently a backup can fail for a number reasons. If the user runs ark backup describe they will see it the backup failed, but no reason, logs or anything that would help them understand the problem and how to fix it:


[centos@ip ark]$ ./ark backup describe nginx-001
Name:         nginx-001
Namespace:    heptio-ark
Labels:       <none>
Annotations:  <none>

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Phase:  Failed

Backup Format Version:  1

Expiration:  2018-03-09 19:50:47 +0000 UTC

Validation errors:  <none>

Persistent Volumes: <none included>

Having a Logs section there or else instructions to find out what happen would be greatly helpful.

ncdc commented 6 years ago

Example error only seen in ark server pod log:

time="2018-02-14T00:50:28Z" level=error msg="backup failed" error="rpc error: code = Unknown desc = error putting object mybackup/ark-backup.json: AccessDenied: Access Denied\n\tstatus code: 403, request id: 096E12AAC407F5B7, host id: ..../....=" key=heptio-ark/mybackup logSource="pkg/controller/backup_controller.go:258"

donbecker commented 6 years ago

Have just set up ark and am getting this, using AWS and the nginx example (non PV). Have I overlooked troubleshooting steps?

donbecker commented 6 years ago

ark backup logs <backup name>

ncdc commented 6 years ago

@donbecker if you wouldn't mind, please join us in #ark-dr on the Kubernetes slack for real-time troubleshooting, or create a new issue. This issue is an RFE to provide more details to the user why a backup failed. Thanks!

rosskukulinski commented 6 years ago

This is super important from a usability perspective. While the ark server log can provide debugging information, it might be hard to hunt down, especially because users creating backups may not have access to the Ark server logs.

One possibility would be to leverage thestatus field in Backup CRD to reflect error/failure details (may require k8s 1.11 - https://github.com/heptio/ark/issues/529) or alternatively leverage Kubernetes Events API to track backup or restore failure events. This is also likely related to backup progress: https://github.com/heptio/ark/issues/20

ncdc commented 6 years ago

@rosskukulinski the status field is currently supported; with k8s 1.11 we gain the ability to use the /status subresource in the http request.

rosskukulinski commented 6 years ago

Product question: What are the the common errors/error states that we want to be able to resolve.

Sources that can help piece together what happened:

per-backup / per-restore logs
restores have warnings/errors file
ark server log
restic pod logs

Restores (Related: #286)

per-restore log
errors file
warnings file

carlisia commented 5 years ago

When we are helping users debug Velero, we often ask for the output from describe as well as the logs for the backup/restore. With the logs, it usually it's not completely helpful unless they can reproduce the failure after setting the log level to debug, which increases the amount of logging to sort thru. And for backups stuck in "InProgress" or for more complicated cases, we have to dig thru the output of the entire Velero log.

One alternative to make debugging easier and faster, and to potentially address this request:

We know at every step of the way what activity the backup/restore is performing. We could keep a running list of these "events" and add them to the describe output, the way Kubernetes does. Knowing where in the process the failure occurred could itself be a hint for how to fix the issue, but otherwise it is a great starting point for where in the logs to start looking.

ncdc commented 5 years ago

FYI, events have a TTL and are automatically deleted after they expire. It's usually a pretty short amount of time - 1 to 2 hours by default, iirc. Also, each event is its own resource in etcd, so you would probably want to avoid having thousands of events for each backup/restore. Finally, the default client-side event broadcasting code that's in client-go has a "spam filter" to make sure that a single component + object target isn't overloading the system. The defaults are fairly low - I think it's something like 10 or 25 events in a minute, if you exceed the threshold, the events that you generated are silently dropped and it's really confusing trying to figure out why they're disappearing into thin air.

carlisia commented 5 years ago

Thanks for the explanation. Yes, I have noticed that events go away, but didn't know why, this is helpful.

Maybe events is not what we need, sounds like overkill. Trying again, w/o using a vocabulary that has any k8s meaning: We have a limited number of 'steps" that happen from running "create" until the backup reaches its final phase of "Complete". I submit that it would be helpful to list these steps in the describe output. And I think they are not so many that would be unreasonable. So, instead of "Events", we would have our own "Steps" section.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

Closing the stale issue.