k8up-io / k8up

Kubernetes and OpenShift Backup Operator
https://k8up.io/
Apache License 2.0
590 stars 62 forks source link

Failed backup stated as succeeded #910

Open xoxys opened 7 months ago

xoxys commented 7 months ago

Description

Backups are stated as Succeeded even if the backup commands failed. This is just an example, the reason why the backup failed has already been fixed. However, it is problematic to describe a defective backup as successful.

Additional Context

No response

Logs

❯ kubectl get backups.k8up.io -n authelia-public 
NAME                    SCHEDULE REF   COMPLETION   PREBACKUP   AGE
postgres                               Succeeded    Finished    12h
postgres-backup-qbq6q   postgres       Succeeded    Finished    12h
2023-11-18T23:02:31Z    INFO    k8up    Starting k8up…  {"version": "2.7.2", "date": "2023-10-09T10:13:29Z", "commit": "45d99dd90dbb2a080e6832c34e96b371216a3e0b", "go_os": "linux", "go_arch": "amd64", "go_version": "go1.19.13", "uid": 65532, "gid": 0}
2023-11-18T23:02:31Z    INFO    k8up.restic initializing
2023-11-18T23:02:31Z    INFO    k8up.restic setting up a signal handler
2023-11-18T23:02:31Z    INFO    k8up.restic.restic  using the following restic options  {"options": [""]}
2023-11-18T23:02:31Z    INFO    k8up.restic.restic.RepoInit.command restic command  {"path": "/usr/local/bin/restic", "args": ["init", "--option", ""]}
2023-11-18T23:02:31Z    INFO    k8up.restic.restic.RepoInit.command Defining RESTIC_PROGRESS_FPS    {"frequency": 0.016666666666666666}
2023-11-18T23:02:32Z    INFO    k8up.restic.restic.unlock   unlocking repository    {"all": false}
2023-11-18T23:02:32Z    INFO    k8up.restic.restic.unlock.command   restic command  {"path": "/usr/local/bin/restic", "args": ["unlock", "--option", ""]}
2023-11-18T23:02:32Z    INFO    k8up.restic.restic.unlock.command   Defining RESTIC_PROGRESS_FPS    {"frequency": 0.016666666666666666}
2023-11-18T23:02:36Z    INFO    k8up.restic.restic.snapshots    getting list of snapshots
2023-11-18T23:02:36Z    INFO    k8up.restic.restic.snapshots.command    restic command  {"path": "/usr/local/bin/restic", "args": ["snapshots", "--option", "", "--json"]}
2023-11-18T23:02:36Z    INFO    k8up.restic.restic.snapshots.command    Defining RESTIC_PROGRESS_FPS    {"frequency": 0.016666666666666666}
2023-11-18T23:02:43Z    INFO    k8up.restic.k8sClient   listing all pods    {"annotation": "k8up.io/backupcommand", "namespace": "authelia-public"}
2023-11-18T23:02:43Z    INFO    k8up.restic.k8sClient   adding to backup list   {"namespace": "authelia-public", "pod": "pgdump-77788b7db9-n4tp6"}
2023-11-18T23:02:43Z    INFO    k8up.restic.k8sExec executing command   {"command": "sh, -c, chmod 600 /var/lib/postgresql/.pgpass && pg_dump --clean", "namespace": "authelia-public", "pod": "pgdump-77788b7db9-n4tp6"}
2023-11-18T23:02:43Z    INFO    k8up.restic.restic.stdinBackup  starting stdin backup   {"filename": "/authelia-public-pgdump", "extension": ".sql"}
2023-11-18T23:02:43Z    INFO    k8up.restic.restic.stdinBackup.command  restic command  {"path": "/usr/local/bin/restic", "args": ["backup", "--option", "", "--stdin-filename", "/authelia-public-pgdump.sql", "--host", "authelia-public", "--json", "--stdin"]}
2023-11-18T23:02:43Z    INFO    k8up.restic.restic.stdinBackup.command  Defining RESTIC_PROGRESS_FPS    {"frequency": 0.016666666666666666}
2023-11-18T23:02:43Z    INFO    k8up.restic.pgdump-77788b7db9-n4tp6.stderr  chmod: changing permissions of '/var/lib/postgresql/.pgpass': Read-only file system
2023-11-18T23:02:43Z    ERROR   k8up.restic.k8sExec streaming data failed   {"namespace": "authelia-public", "pod": "pgdump-77788b7db9-n4tp6", "error": "command terminated with exit code 1"}
github.com/k8up-io/k8up/v2/restic/kubernetes.PodExec.func1
    /home/runner/work/k8up/k8up/restic/kubernetes/pod_exec.go:74
2023-11-18T23:02:48Z    INFO    k8up.restic.restic.stdinBackup.progress restic output   {"msg": "{\"message_type\":\"error\",\"error\":{\"Op\":\"read\",\"Path\":\"/authelia-public-pgdump.sql\",\"Err\":{}},\"during\":\"archival\",\"item\":\"/authelia-public-pgdump.sql\"}"}
2023-11-18T23:02:48Z    ERROR   k8up.restic.restic.stdinBackup.progress /authelia-public-pgdump.sql during archival read    {"error": "error occurred during backup"}
github.com/k8up-io/k8up/v2/restic/logging.(*BackupOutputParser).out
    /home/runner/work/k8up/k8up/restic/logging/logging.go:156
github.com/k8up-io/k8up/v2/restic/logging.writer.Write
    /home/runner/work/k8up/k8up/restic/logging/logging.go:103
io.copyBuffer
    /opt/hostedtoolcache/go/1.19.13/x64/src/io/io.go:429
io.Copy
    /opt/hostedtoolcache/go/1.19.13/x64/src/io/io.go:386
os/exec.(*Cmd).writerDescriptor.func1
    /opt/hostedtoolcache/go/1.19.13/x64/src/os/exec/exec.go:407
os/exec.(*Cmd).Start.func1
    /opt/hostedtoolcache/go/1.19.13/x64/src/os/exec/exec.go:544
2023-11-18T23:02:48Z    INFO    k8up.restic.restic.stdinBackup.progress backup finished {"new files": 0, "changed files": 0, "errors": 1}
2023-11-18T23:02:48Z    INFO    k8up.restic.restic.stdinBackup.progress stats   {"time": 2.627027521, "bytes added": 0, "bytes processed": 0}
2023-11-18T23:02:48Z    INFO    k8up.restic.restic.MountCollector   stats mount dir doesn't exist, skipping stats   {"dir": "/data"}
2023-11-18T23:02:49Z    INFO    k8up.restic.restic.stdinBackup.progress restic output   {"msg": "Warning: at least one source file could not be read"}
2023-11-18T23:02:49Z    INFO    k8up.restic backups of annotated jobs have finished successfully
2023-11-18T23:02:49Z    INFO    k8up.restic.restic.backup   starting backup
2023-11-18T23:02:49Z    INFO    k8up.restic.restic.backup   backupdir does not exist, skipping. Sending snapshot list   {"dirname": "/data"}
2023-11-18T23:02:49Z    INFO    k8up.restic.restic.snapshots    getting list of snapshots
2023-11-18T23:02:49Z    INFO    k8up.restic.restic.snapshots.command    restic command  {"path": "/usr/local/bin/restic", "args": ["snapshots", "--option", "", "--json"]}
2023-11-18T23:02:49Z    INFO    k8up.restic.restic.snapshots.command    Defining RESTIC_PROGRESS_FPS    {"frequency": 0.016666666666666666}


### Expected Behavior

Failed backups should be stated as failed instead.

### Steps To Reproduce

_No response_

### Version of K8up

v2.7.2

### Version of Kubernetes

v1.27.7+k3s1

### Distribution of Kubernetes

K3s
poyaz commented 7 months ago

Hi

I checked this problem, and after testing this situation I recognized this problem happened because of restic command.

When the backupcommand annotation is executed, the stdin of the command pipe into restic command, and the restic store stream data in the snapshot

Unfortunately, can't fix this problem because of restic. But k8up has a summary backup for detecting the status of the backup. You can use Webhook or Prometheus to get the status of the backup

Also, I have a solution for fixing this problem: We can add an annotation to handle errors in the backup command and delete the snapshot when the backup command fails. This backward compatible

@xoxys @Kidswiss

roobre commented 1 month ago

Just wanted to plusone this. I had this problem happen to me with two different workloads for different reasons (xz not being available, and a wrong env syntax for postgres).

If I hadn't checked manually with restic, I wouldn't have noticed this! I think having failed backups marked as such would be a great UX improvement.

johbo commented 1 month ago

Also bumped into this problem, in my case probably due to file permission issues, did spot the following in the logs:

INFO    k8up.restic.restic.backup.progress    restic output    {"msg": "Warning: at least one source file could not be read"}   

It would be great if there would be a way to make the issue more visible. I only noticed this during tests of the restore procedure.

damsien commented 2 weeks ago

Concerning the backup procedure, I solve the issue by specifying this under the spec of the Backup object (and Schedule object as well).

  podSecurityContext:
    fsGroup: 0
    runAsUser: 0

I wanted to backup my Nextcloud where the data directory is only permitted for www-data:www-data. So I ran my Backup with the user 0 (root) on purpose. I think it only works if you have the permission to execute the k8up jobs as root on the cluster. Otherwise you should try with other user id and group id.