vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.63k stars 1.39k forks source link

Random restic failures #2963

Open foosinn opened 4 years ago

foosinn commented 4 years ago

What steps did you take and what happened:

We have setup Velero 1.4 in our Openshift Cluster. The majority of our Backups work fine. Sometimes the restic backup fails taking almost no time. This happens randomly over all our PVCs.

What did you expect to happen:

Consistent backup results

The output of the following commands will help us better understand what's going on:

Velero Backup Logs https://paste.kube.f2o.io/raw/F6ko42DoVWLMo6LjTqUog

Anything else you would like to add:

Is there a way to see the full output of the restic command?

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

nrb commented 4 years ago

More detailed output for the restic logs would be in the pods on the node that the restic job ran on.

Velero's implementation of restic support is in a DaemonSet, with pods on each node so that the restic pods can access all workload volumes on that node. Right now, the logs for each of those pods is on the node, and not aggregated. So finding the node is a matter of looking for the failed PodVolumeBackups and the relevant pod, and inspecting the pod for it's node name, then finding the restic pod on that node and looking at the kubectl logs <restic pod name> -n velero.

foosinn commented 4 years ago

I was unable to find any hints why the backup failed in the output. Any idears?

Thanks a lot for your help!

Restic pod log:

time="2020-09-26T22:08:25Z" level=error msg="Error running command=restic backup --repo=***HIDDEN*** --password-file=/tmp/velero-restic-credentials-ot-richcoon666888759 --cache-dir=/scratch/.cache/restic . --tag=ns=ot-richcoon --tag=pod=ot-richcoon-mariadb-0 --tag=pod-uid=6c04f2a8-fc15-11ea-a0e6-9cdc71c0dd8e --tag=pvc-uid=8e0fd405-adf2-11e9-bacf-9cdc71c0df4c --tag=volume=data --tag=backup=daily-20200926220020 --tag=backup-uid=ab6e46ab-0043-11eb-a0e6-9cdc71c0dd8e --host=velero --json --parent=66d571c2, stdout={\"message_type\":\"status\",\"percent_done\":0,\"total_files\":1,\"total_bytes\":2752740}\n{\"message_type\":\"status\",\"percent_done\":0.029610298414193936,\"total_files\":101,\"files_done\":4,\"total_bytes\":129750229,\"bytes_done\":3841943,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"percent_done\":0.07830972577420471,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":11982754,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.1062338167551264,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":16255627,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.15110612585520725,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":23121873,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.172560523050957,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":26404770,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.2800502031653999,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":42852566,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.30185872338750297,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":46189650,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.3775959581287736,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":57778768,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.41558080407865655,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":63591112,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.4479426897755702,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":68543045,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.5027193860506478,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":76924835,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.5354016469873961,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":81925791,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.5838848581465317,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":89344568,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.6193810139671462,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":94776099,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.673913692310229,\"total_files\":129,\"files_done\":4,\"total_bytes\":153017443,\"bytes_done\":103120550,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.6759880963374875,\"total_files\":129,\"files_done\":5,\"total_bytes\":153017443,\"bytes_done\":103437970,\"current_files\":[\"/data/ib_logfile0\",\"/data/ib_logfile1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":1,\"seconds_remaining\":8,\"percent_done\":0.8421503553683092,\"total_files\":129,\"files_done\":36,\"total_bytes\":153017443,\"bytes_done\":128863694,\"current_files\":[\"/data/ib_logfile0\",\"/data/ibdata1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":2,\"percent_done\":0.8451642862702914,\"total_files\":129,\"files_done\":47,\"total_bytes\":153017443,\"bytes_done\":129324878,\"current_files\":[\"/data/ib_logfile0\",\"/data/ibdata1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":2,\"percent_done\":0.8451642862702914,\"total_files\":129,\"files_done\":48,\"total_bytes\":153017443,\"bytes_done\":129324878,\"current_files\":[\"/data/ib_logfile0\",\"/data/ibdata1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":2,\"percent_done\":0.8489080947457736,\"total_files\":129,\"files_done\":103,\"total_bytes\":153017443,\"bytes_done\":129897746,\"current_files\":[\"/data/ibdata1\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":3,\"percent_done\":0.9927619166920728,\"total_files\":129,\"files_done\":125,\"total_bytes\":153017443,\"bytes_done\":151909890,\"current_files\":[\"/data/ibdata1\",\"/data/richcoon/sessions.ibd\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":3,\"percent_done\":0.9927619166920728,\"total_files\":129,\"files_done\":126,\"total_bytes\":153017443,\"bytes_done\":151909890,\"current_files\":[\"/data/ibdata1\",\"/data/richcoon/sessions.ibd\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":3,\"percent_done\":1,\"total_files\":129,\"files_done\":129,\"total_bytes\":153017443,\"bytes_done\":153017443,\"current_files\":[\"/full_dump.gz\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":4,\"percent_done\":1,\"total_files\":129,\"files_done\":129,\"total_bytes\":153017443,\"bytes_done\":153017443}\n{\"message_type\":\"status\",\"seconds_elapsed\":5,\"percent_done\":1,\"total_files\":129,\"files_done\":129,\"total_bytes\":153017443,\"bytes_done\":153017443}\n{\"message_type\":\"status\",\"second_new\":0,\"files_changed\":7,\"filesotal_files\":129,\"files_done\":129,\"total_bytes\":153017443,\"bytes_done\":153017443}\ncurrent_files\":[\"/full_dump.gz\"]}\n{\"message_type\":\"status\",\"second_new\":0,\"files_changed\":7,\"filesotal_files\":129,\"files_done\":129,\"total_bytes\":153017443,\"bytes_done\":153017443}\ncurrent_files\":[\"/full_dump.gz\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":4,\"percent_done\":1,\"total_files\":129,\"files_done\":129,\"total_bytes\":153017443,\"{\"message_type\":\"status\",\"seconds_elapsed\":5,\"percent_done\":1,\"total_files\":129,\"files_done\":129,\"total_bytes\":153017443,\"bytes_done\":153017443}\n{\"message_type\":\"status\",\"seconds_elapsed\":4,\"percent_done\":1,\"total_files\":129,\"files_done\":129,\"total_bytes\":153017443,\"{\"message_type\":\"status\",\"seconds_elapsed\":5,\"percent_done\":1,\"total_files\":129,\"files_done\":129,\"total_bytes\":153017443,\"bytes_done\":153017443}\n, stderr=" backup=velero/daily-20200926220020 controller=pod-volume-backup error="unable to find summary in restic backup command output" error.file="/github.com/vmware-tanzu/velero/pkg/restic/exec_commands.go:175" error.function=github.com/vmware-tanzu/velero/pkg/restic.getSummaryLine logSource="pkg/controller/pod_volume_backup_controller.go:283" name=daily-20200926220020-nwgvb namespace=velero

Pretty stdout:

{
  "message_type": "status",
  "percent_done": 0,
  "total_files": 1,
  "total_bytes": 2752740
}
{
  "message_type": "status",
  "percent_done": 0.029610298414193936,
  "total_files": 101,
  "files_done": 4,
  "total_bytes": 129750229,
  "bytes_done": 3841943,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "percent_done": 0.07830972577420471,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 11982754,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.1062338167551264,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 16255627,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.15110612585520725,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 23121873,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.172560523050957,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 26404770,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.2800502031653999,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 42852566,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.30185872338750297,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 46189650,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.3775959581287736,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 57778768,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.41558080407865655,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 63591112,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.4479426897755702,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 68543045,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.5027193860506478,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 76924835,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.5354016469873961,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 81925791,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.5838848581465317,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 89344568,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.6193810139671462,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 94776099,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsedparse error: Invalid numeric literal at line 28, column 14
": 1,
  "seconds_remaining": 8,
  "percent_done": 0.673913692310229,
  "total_files": 129,
  "files_done": 4,
  "total_bytes": 153017443,
  "bytes_done": 103120550,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.6759880963374875,
  "total_files": 129,
  "files_done": 5,
  "total_bytes": 153017443,
  "bytes_done": 103437970,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ib_logfile1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 1,
  "seconds_remaining": 8,
  "percent_done": 0.8421503553683092,
  "total_files": 129,
  "files_done": 36,
  "total_bytes": 153017443,
  "bytes_done": 128863694,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ibdata1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 2,
  "percent_done": 0.8451642862702914,
  "total_files": 129,
  "files_done": 47,
  "total_bytes": 153017443,
  "bytes_done": 129324878,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ibdata1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 2,
  "percent_done": 0.8451642862702914,
  "total_files": 129,
  "files_done": 48,
  "total_bytes": 153017443,
  "bytes_done": 129324878,
  "current_files": [
    "/data/ib_logfile0",
    "/data/ibdata1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 2,
  "percent_done": 0.8489080947457736,
  "total_files": 129,
  "files_done": 103,
  "total_bytes": 153017443,
  "bytes_done": 129897746,
  "current_files": [
    "/data/ibdata1"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 3,
  "percent_done": 0.9927619166920728,
  "total_files": 129,
  "files_done": 125,
  "total_bytes": 153017443,
  "bytes_done": 151909890,
  "current_files": [
    "/data/ibdata1",
    "/data/richcoon/sessions.ibd"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 3,
  "percent_done": 0.9927619166920728,
  "total_files": 129,
  "files_done": 126,
  "total_bytes": 153017443,
  "bytes_done": 151909890,
  "current_files": [
    "/data/ibdata1",
    "/data/richcoon/sessions.ibd"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 3,
  "percent_done": 1,
  "total_files": 129,
  "files_done": 129,
  "total_bytes": 153017443,
  "bytes_done": 153017443,
  "current_files": [
    "/full_dump.gz"
  ]
}
{
  "message_type": "status",
  "seconds_elapsed": 4,
  "percent_done": 1,
  "total_files": 129,
  "files_done": 129,
  "total_bytes": 153017443,
  "bytes_done": 153017443
}
{
  "message_type": "status",
  "seconds_elapsed": 5,
  "percent_done": 1,
  "total_files": 129,
  "files_done": 129,
  "total_bytes": 153017443,
  "bytes_done": 153017443
}
{
  "message_type": "status",
  "second_new": 0,
  "files_changed": 7,
  "filesotal_files": 129,
  "files_done": 129,
  "total_bytes": 153017443,
  "bytes_done": 153017443
}