Script leaves zombie processes

tgruenert commented 4 months ago

After some days of working (huge dataset to backup) there are zombie processes.

ps faxu in Container:

PID   USER     TIME  COMMAND
    1 root      0:02 python3 /scripts/backup_client.py schedule 30 22 * * * 
  277 root      0:00 [timeout]
  663 root      0:00 [timeout]
 1047 root      0:00 [timeout]
 1430 root      0:00 [timeout]
 1815 root      0:00 [timeout]
 1918 root      0:00 sh -c clear; (bash || ash || sh)
 1925 root      0:00 bash
 1928 root      0:00 ps faxu

ps faxu at host

2165066 ?        Sl     3:33 /var/lib/rancher/rke2/data/v1.28.3-rke2r2-0599290799e6/bin/containerd-shim-runc-v2 -namespace k8s.io -id c336a4d80416e4319028c86da773bb851e234c3e0dd9fde90fd61b
2165085 ?        Ss     0:00  \_ /pause
2165136 ?        Ss     0:02  \_ python3 /scripts/backup_client.py schedule 30 22 * * *
3759910 ?        Zs     0:00  |   \_ [timeout] <defunct>
3012112 ?        Zs     0:00  |   \_ [timeout] <defunct>
2275457 ?        Zs     0:00  |   \_ [timeout] <defunct>
1519177 ?        Zs     0:00  |   \_ [timeout] <defunct>
 753122 ?        Zs     0:00  |   \_ [timeout] <defunct>
2165169 ?        Ss     6:56  \_ python3 /restic_mon.py
2976457 pts/0    Ss     0:00  \_ sh -c clear; (bash || ash || sh)
2976464 pts/0    S+     0:00      \_ bash

I have absolut no idea what happen there and how to solve this. Anybody else?

tgruenert commented 4 months ago

test on shell inside container

timeout 2 sleep 3

gives also a zombie

micw commented 4 months ago

https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

tgruenert commented 4 months ago

just as a further observation: this happens not only on large datasets. a question that came up: what part of process get a timeout? is our backup still complete?

micw commented 4 months ago

Timeout is used for pruning: https://github.com/evermind/docker-restic-backupclient/blob/master/backup_client.py#L366

Can you check if the latest build (master) at https://github.com/micw/docker-restic-backupclient/pkgs/container/restic-backupclient solves the issue for you?

tgruenert commented 4 months ago

Independent from zombies - timeout should not occure there:


2024-05-13 22:31:55,752    INFO: Cleanup finished.
2024-05-13 22:31:55,752    INFO: Using extra config from /config/backup.yaml
2024-05-13 22:31:55,753    INFO: Initializing repository
2024-05-13 22:31:56,001    INFO: Repository was already initialized.
2024-05-13 22:31:56,001    INFO: Unlocking repository
2024-05-13 22:31:56,626    INFO: Pruning repository (timeout 12h)
loading indexes...
loading all snapshots...
finding data that is still in use for 16 snapshots
[0:00] 100.00%  16 / 16 snapshots

searching used packs...
collecting packs for deletion and repacking
[0:00] 100.00%  1350 / 1350 packs processed

to repack:           769 blobs / 572.866 MiB
this removes:        694 blobs / 514.378 MiB
to delete:           799 blobs / 558.760 MiB
total prune:        1493 blobs / 1.048 GiB
remaining:         28525 blobs / 21.095 GiB
unused size after prune: 1.049 GiB (4.97% of remaining size)

repacking packs
[0:04] 100.00%  35 / 35 packs repacked

rebuilding index
[0:01] 100.00%  1287 / 1287 packs processed

deleting obsolete index files
[0:00] 100.00%  3 / 3 files deleted

removing 68 old packs
[0:03] 100.00%  68 / 68 files deleted

done
2024-05-13 22:32:09,407    INFO: Prune finished.
2024-05-13 22:32:09,407    INFO: Scheduling next backup at 2024-05-14 22:00:00

and

/usr/bin# printenv | grep TIMEOUT
RESTIC_PRUNE_TIMEOUT=12h

micw commented 4 months ago

Correct. If a timeout occurs, you'll see "Terminated" in the logs

tgruenert commented 4 months ago

testing your solution was successful. no more zombie after backup. would you give me an PR please?

tgruenert commented 4 months ago

thank you! pr is merged.

realestatepilot / docker-restic-backupclient

Script leaves zombie processes #12