Closed adawolfs closed 4 years ago
Hi
Reading through v1.2.0 Changelog I've found that you added more verbose logs from Restic so after follow the guide to update and run the backup again I notice it ends in unable to find summary in restic backup command output and indeed message_type summary is missing in the output (It doesn't even seems to finish)
Here the last output messages:
{\"message_type\":\"status\",\"seconds_elapsed\":121,\"seconds_remaining\":253,\"percent_done\":0.32345830900368333,\"total_files\":17011,\"files_done\":7182,\"total_bytes\":56567060801,\"bytes_done\":18297085832,\"current_files\":[\"/liferay-data/document_library/20157/11701345/68/92/68928.afsh/68928_1.0.afsh\",\"/liferay-data/document_library/20157/11701345/68/92/68929.afsh/68929_1.0.afsh\"]}\n{\"message_type\":\"status\",\"seconds_elapsed\":121,\"seconds_remaining\":253,\"percent_done\":0.3235266029179383,\"total_files\":17011,\"files_done\":7182,\"total_bytes\":56567060801,\"bytes_done\":18300949018,\"current_files\":[\"/liferay-data/document_library/20157/11701345/68/92/68928.afsh/68928_1.0.afsh\",\"/liferay-data/document_library/20157/11701345/68/92/68929.afsh/68929_1.0.afsh\"]}\n, stderr=" backup=velero/bpo1801-tst-20191126035227 controller=pod-volume-backup error="unable to find summary in restic backup command output" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/exec_commands.go:169" error.function=github.com/vmware-tanzu/velero/pkg/restic.getSummaryLine logSource="pkg/controller/pod_volume_backup_controller.go:261" name=bpo1801-tst-20191126035227-ltrt7 namespace=velero
@adawolfs apologies for the delay, we've been kind of swamped with KubeCon and the US holidays.
With the signal: killed
error you posted originally, I'm wondering if you're hitting an OOM (out-of-memory) issue in the velero restic daemonset pod. Do you have CPU/mem requests/limits on the daemonset? You might start by increasing those (at a minimum, the mem limit).
@adawolfs just checking to see if you had a chance to look at my previous comment.
Closing this out as inactive -- feel free to reach out again as needed.
With the
signal: killed
error you posted originally, I'm wondering if you're hitting an OOM (out-of-memory) issue in the velero restic daemonset pod. Do you have CPU/mem requests/limits on the daemonset? You might start by increasing those (at a minimum, the mem limit).
Not original author here, but have hit the same issue and can confirm that restic
process got killed due to OOM. (Velero 1.9.0
& 1.9.1
).
Fixed by lifting the memory resource limit from 1Gi
to 2Gi
on the restic
DaemonSet.
@skriss Since restic
process did consume a little more than 1Gi
for at least 2 users, you might want consider increasing the default in the Velero helm-chart.
FWIW, mem stats for restic
when it consumes 1.1Gi
:
see
RssAnon
# cat /proc/2550974/status
Name: restic
Umask: 0022
State: S (sleeping)
Tgid: 2550974
Ngid: 2551098
Pid: 2550974
PPid: 2549993
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 128
Groups:
NStgid: 2550974 44
NSpid: 2550974 44
NSpgid: 2549993 1
NSsid: 2549993 1
VmPeak: 1882924 kB
VmSize: 1882924 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 1137284 kB
VmRSS: 1137284 kB
RssAnon: 1122232 kB
RssFile: 15052 kB
RssShmem: 0 kB
VmData: 1249844 kB
VmStk: 132 kB
VmExe: 8604 kB
VmLib: 4 kB
VmPTE: 2468 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
THP_enabled: 1
Threads: 69
SigQ: 1/3090235
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: fffffffd7fc1feff
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: ffffffff,ffffffff
Cpus_allowed_list: 0-63
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list: 0-1
voluntary_ctxt_switches: 106658
nonvoluntary_ctxt_switches: 3234
## restic process read 26Gi at this point
# cat /proc/2550974/io | numfmt --to=iec-i --field=2
rchar: 60Gi
wchar: 54Gi
syscr: 1.2Mi
syscw: 3.5Mi
read_bytes: 26Gi
write_bytes: 27Gi
cancelled_write_bytes: 15Gi
## PVC restic is backing up has 48Gi of data in it
# df -Ph |grep pvc
/dev/rbd1 98G 48G 51G 49% /var/lib/kubelet/pods/REDACTED/volumes/kubernetes.io~csi/pvc-REDACTED/mount
Hola!
I have an issue when I try to create backups at some Openshift projects (namespaces) It seems like restic is unable to perform backups at some volumes, I'm actually using velero + gcp in order to backup database and application files from glusterfs volumes in openshift 3.11, the pods was created by DeploymentConfigs, Deployments and Statefulsets
The annotations was added using something like this:
oc patch deployment/mysql -p '{"spec":{"template":{"metadata":{"annotations":{"backup.velero.io/backup-volumes": "mysql-data"}}}}}'
After usingvelero backup create backup-name-$(date -u +%Y%m%d%H%M%S) --include-namespaces namespace-name
to create a backup some of them ends in a PartiallyFailed (X errors) status, but this happens just in some projectsAt restic logs I found
error `controller=pod-volume-backup error="signal: killed" error.file="/go/src/github.com/heptio/velero/pkg/controller/pod_volume_backup_controller.go:245"
Could you please help me to figure out what is actually making backups to fail?
I'm adding all the useful logs here: https://gist.github.com/adawolfs/aff5b78d068bfbcf5396c36c34a65559
velero version
):velero client config get features
):oc version
):Server **** openshift v3.11.117 kubernetes v1.11.0+d4cacc0