Closed rdodev closed 3 years ago
Example error only seen in ark server pod log:
time="2018-02-14T00:50:28Z" level=error msg="backup failed" error="rpc error: code = Unknown desc = error putting object mybackup/ark-backup.json: AccessDenied: Access Denied\n\tstatus code: 403, request id: 096E12AAC407F5B7, host id: ..../....=" key=heptio-ark/mybackup logSource="pkg/controller/backup_controller.go:258"
Have just set up ark and am getting this, using AWS and the nginx example (non PV). Have I overlooked troubleshooting steps?
ark backup logs <backup name>
@donbecker if you wouldn't mind, please join us in #ark-dr on the Kubernetes slack for real-time troubleshooting, or create a new issue. This issue is an RFE to provide more details to the user why a backup failed. Thanks!
This is super important from a usability perspective. While the ark server log can provide debugging information, it might be hard to hunt down, especially because users creating backups may not have access to the Ark server logs.
One possibility would be to leverage thestatus
field in Backup CRD to reflect error/failure details (may require k8s 1.11 - https://github.com/heptio/ark/issues/529) or alternatively leverage Kubernetes Events API to track backup or restore failure events. This is also likely related to backup progress: https://github.com/heptio/ark/issues/20
@rosskukulinski the status
field is currently supported; with k8s 1.11 we gain the ability to use the /status
subresource in the http request.
Product question: What are the the common errors/error states that we want to be able to resolve.
Sources that can help piece together what happened:
Restores (Related: #286)
When we are helping users debug Velero, we often ask for the output from describe
as well as the logs for the backup/restore. With the logs, it usually it's not completely helpful unless they can reproduce the failure after setting the log level to debug
, which increases the amount of logging to sort thru. And for backups stuck in "InProgress" or for more complicated cases, we have to dig thru the output of the entire Velero log.
One alternative to make debugging easier and faster, and to potentially address this request:
We know at every step of the way what activity the backup/restore is performing. We could keep a running list of these "events" and add them to the describe output, the way Kubernetes does. Knowing where in the process the failure occurred could itself be a hint for how to fix the issue, but otherwise it is a great starting point for where in the logs to start looking.
FYI, events have a TTL and are automatically deleted after they expire. It's usually a pretty short amount of time - 1 to 2 hours by default, iirc. Also, each event is its own resource in etcd, so you would probably want to avoid having thousands of events for each backup/restore. Finally, the default client-side event broadcasting code that's in client-go has a "spam filter" to make sure that a single component + object target isn't overloading the system. The defaults are fairly low - I think it's something like 10 or 25 events in a minute, if you exceed the threshold, the events that you generated are silently dropped and it's really confusing trying to figure out why they're disappearing into thin air.
Thanks for the explanation. Yes, I have noticed that events go away, but didn't know why, this is helpful.
Maybe events is not what we need, sounds like overkill. Trying again, w/o using a vocabulary that has any k8s meaning: We have a limited number of 'steps" that happen from running "create" until the backup reaches its final phase of "Complete". I submit that it would be helpful to list these steps in the describe output. And I think they are not so many that would be unreasonable. So, instead of "Events", we would have our own "Steps" section.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Closing the stale issue.
Presently a backup can fail for a number reasons. If the user runs
ark backup describe
they will see it the backup failed, but no reason, logs or anything that would help them understand the problem and how to fix it:Having a
Logs
section there or else instructions to find out what happen would be greatly helpful.