oetiker / znapzend

zfs backup with remote capabilities and mbuffer integration.
www.znapzend.org
GNU General Public License v3.0
609 stars 137 forks source link

Backups seem to be running twice?? or dst_0_precmd/dst_0_postcmd scripts being run twice?? #562

Closed kevdogg closed 2 years ago

kevdogg commented 2 years ago

I recently started having an issue that it seems when my backups are being run -- the process seems to repeat itself twice -- or I'm getting notified twice about the matter.

Here is further explanation. Host ArchLinux. Znapsend Version 0.21.0-1

My backup schedule:

#sudo znapzendzetup list --recursive --inherited
*** backup plan: zroot/data/timemachine ***
           dst_0 = root@10.0.1.197:tank/backups/zfs_backup/arch-TM
      dst_0_plan = 7days=>6hours,30days=>12hours,90days=>1day,1year=>1week,10years=>1month
    dst_0_precmd = /root/bin/start_zfs_transfer.sh
    dst_0_pstcmd = /root/bin/end_zfs_transfer.sh
         enabled = on
         mbuffer = /usr/local/bin/mbuffer
    mbuffer_size = 200M
   post_znap_cmd = off
    pre_znap_cmd = off
       recursive = on
             src = zroot/data/timemachine
        src_plan = 7days=>6hours,30days=>12hours
        tsformat = %m-%d-%Y-%H:%M:%S
      zend_delay = 0

NOTE: if you have modified your configuration, send a HUP signal
(pkill -HUP znapzend) to your znapzend daemon for it to notice the change.

So I have the following run as pre/post commands to be run:

    dst_0_precmd = /root/bin/start_zfs_transfer.sh
    dst_0_pstcmd = /root/bin/end_zfs_transfer.sh

These scripts basically send an email when the process starts indicating the time and then send another when the transfer ends showing the end time and Elapsed Time (ET).

I'm trying to figure out why each of these pre/post command scripts are being run twice:

For example I'm receiving two post transfer emails (with different elapsed times):

Email #1

ET: 0d, 0h, 0m, 44s

ZFS backup from ARCH-TM-(10.0.1.107) to FreeNAS concluded @ Wednesday Oct 20 2021  6:00:44 AM CDT

Summary...
Elapsed Time: 0 days, 0 hours(s), 0 minutes(s), 44 second(s)
Start Time: Wednesday Oct 20 2021  6:00:00 AM CDT
End Time: Wednesday Oct 20 2021  6:00:44 AM CDT

Email #2

ET: 0d, 0h, 0m, 3s

ZFS backup from ARCH-TM-(10.0.1.107) to FreeNAS concluded @ Wednesday Oct 20 2021  6:00:03 AM CDT

Summary...
Elapsed Time: 0 days, 0 hours(s), 0 minutes(s), 3 second(s)
Start Time: Wednesday Oct 20 2021  6:00:00 AM CDT
End Time: Wednesday Oct 20 2021  6:00:03 AM CDT

Both jobs begin at same time and end at different times.

So I'm not exactly sure if this means the backup job is running twice (I'm doubting it since the start times are the same), or for some reason the dst_0_precmd and dst_0_pstcmd scripts are being called twice. I'm not sure how to debug this issue since for the longest time, I received only one "pair" of emails. Now I'm receiving two "pairs" of emails.

jochendemuth commented 2 years ago

I find systemd logging helpful in understanding/troubleshooting znapzend behavior. Try "journalctl -e -u znapzend"

kevdogg commented 2 years ago

Thanks for your help

So here is my journalctl lines from the last znap run -- to my eyes nothing seems out of the ordinary:

-- Journal begins at Fri 2019-11-08 22:32:38 CST, ends at Fri 2021-10-29 14:31:58 CDT. --
Oct 29 12:00:00 arch-TM znapzend[85745]: snapshot worker for zroot/data/timemachine spawned (526858)
Oct 29 12:00:00 arch-TM znapzend[526858]: creating recursive snapshot on zroot/data/timemachine
Oct 29 12:00:00 arch-TM znapzend[526858]: checking for explicitly excluded ZFS dependent datasets under 'zroot/data/timemachine'
Oct 29 12:00:00 arch-TM znapzend[85745]: snapshot worker for zroot/data/timemachine done (526858)
Oct 29 12:00:00 arch-TM znapzend[85745]: send/receive worker for zroot/data/timemachine spawned (526864)
Oct 29 12:00:00 arch-TM znapzend[526864]: starting work on backupSet zroot/data/timemachine
Oct 29 12:00:00 arch-TM znapzend[526864]: running pre-send-command for root@10.0.1.197:tank/backups/zfs_backup/arch-TM
Oct 29 12:00:00 arch-TM znapzend[526864]: sending snapshots from zroot/data/timemachine to root@10.0.1.197:tank/backups/zfs_backup/arch-TM
Oct 29 12:00:03 arch-TM znapzend[526864]: cleaning up snapshots recursively under destination root@10.0.1.197:tank/backups/zfs_backup/arch-TM
Oct 29 12:00:04 arch-TM znapzend[526864]: now will look if there is anything to clean in children of destination root@10.0.1.197:tank/backups/zfs_backup/arch-TM
Oct 29 12:00:04 arch-TM znapzend[526864]: running post-send-command for root@10.0.1.197:tank/backups/zfs_backup/arch-TM
Oct 29 12:00:04 arch-TM znapzend[526864]: checking to clean up snapshots recursively from source zroot/data/timemachine
Oct 29 12:00:05 arch-TM znapzend[526864]: cleaning up 1 source snapshots recursively under zroot/data/timemachine
Oct 29 12:00:05 arch-TM znapzend[526864]: now will look if there is anything to clean in children of source zroot/data/timemachine
Oct 29 12:00:05 arch-TM znapzend[526864]: done with backupset zroot/data/timemachine in 5 seconds
Oct 29 12:00:05 arch-TM znapzend[85745]: send/receive worker for zroot/data/timemachine done (526864)

Examining the source above I see the pre-send-command and post-send-command firing one time a piece.

The actual start command script is below -- It basically boils down to one mail command:

#!/bin/bash

OUTDIR="/root/tmp"
OUTFILE="${OUTDIR}/epoch_start_time_zfs_transfer"

local_IP=$(/sbin/ip -o -4 addres list eth0 | awk '{print $4}' | cut -d/ -f1)
destination_IP=10.0.1.197

mkdir -p ${OUTDIR}

date=$(date)
date_epoch=$(date +%s)
mail_body(){
        echo "Beginning ZFS backup from ARCH-TM-(${local_IP}) to FreeNAS @ ${date}"
}

printf "START_DATE=%d\n" "${date_epoch}" > ${OUTFILE}
mail_body | mail -s "Beginning ZFS Backup (${local_IP}->${destination_IP})@ ${date}" recipient@gmail.com

I use postfix as my MTA. So I consulted my postfix journalctl logs. What I can see is a total of 4 messages being sent. I'm aware the postfix logs don't report the contents, however they are the two "pair" of messages I'm talking about:

-- Journal begins at Fri 2019-11-08 22:32:38 CST, ends at Fri 2021-10-29 14:46:49 CDT. --
Oct 29 12:00:00 arch-TM postfix/pickup[525873]: 693405B1B3: uid=0 from=<root>
Oct 29 12:00:00 arch-TM postfix/cleanup[526878]: 693405B1B3: message-id=<20211029170000.693405B1B3@arch-TM.domain.com>
Oct 29 12:00:00 arch-TM postfix/qmgr[86090]: 693405B1B3: from=<root@arch-TM.domain.com>, size=481, nrcpt=1 (queue active)
Oct 29 12:00:00 arch-TM postfix/pickup[525873]: 8B37D5B1B5: uid=0 from=<root>
Oct 29 12:00:00 arch-TM postfix/cleanup[526878]: 8B37D5B1B5: message-id=<20211029170000.8B37D5B1B5@arch-TM.domain.com>
Oct 29 12:00:00 arch-TM postfix/qmgr[86090]: 8B37D5B1B5: from=<root@arch-TM.domain.com>, size=487, nrcpt=1 (queue active)
Oct 29 12:00:01 arch-TM postfix/pickup[525873]: 205575B1B7: uid=0 from=<root>
Oct 29 12:00:01 arch-TM postfix/cleanup[526878]: 205575B1B7: message-id=<20211029170001.205575B1B7@arch-TM.domain.com>
Oct 29 12:00:01 arch-TM postfix/qmgr[86090]: 205575B1B7: from=<root@arch-TM.domain.com>, size=703, nrcpt=1 (queue active)
Oct 29 12:00:01 arch-TM postfix/smtp[526883]: 693405B1B3: to=<recipient@gmail.com>, relay=smtp.gmail.com[74.125.202.108]:465, delay=0.89, delays=0.06/0.06/0.34>
Oct 29 12:00:01 arch-TM postfix/qmgr[86090]: 693405B1B3: removed
Oct 29 12:00:01 arch-TM postfix/smtp[526900]: 8B37D5B1B5: to=<recipient@gmail.com>, relay=smtp.gmail.com[74.125.202.108]:465, delay=1.1, delays=0/0.05/0.25/0.7>
Oct 29 12:00:01 arch-TM postfix/qmgr[86090]: 8B37D5B1B5: removed
Oct 29 12:00:01 arch-TM postfix/smtp[526938]: 205575B1B7: to=<recipient@gmail.com>, relay=smtp.gmail.com[74.125.202.108]:465, delay=0.83, delays=0.02/0.08/0.28>
Oct 29 12:00:01 arch-TM postfix/qmgr[86090]: 205575B1B7: removed
Oct 29 12:00:04 arch-TM postfix/pickup[525873]: E73855B1B9: uid=0 from=<root>
Oct 29 12:00:04 arch-TM postfix/cleanup[526878]: E73855B1B9: message-id=<20211029170004.E73855B1B9@arch-TM.domain.com>
Oct 29 12:00:04 arch-TM postfix/qmgr[86090]: E73855B1B9: from=<root@arch-TM.domain.com>, size=703, nrcpt=1 (queue active)
Oct 29 12:00:05 arch-TM postfix/smtp[526883]: E73855B1B9: to=<recipient@gmail.com>, relay=smtp.gmail.com[74.125.202.108]:465, delay=0.69, delays=0/0/0.24/0.45,>
Oct 29 12:00:05 arch-TM postfix/qmgr[86090]: E73855B1B9: removed

So I'm really confused -- I have four emails being sent - but only two commands - precmd and pstcmd which seem to be running per logs once each.

jochendemuth commented 2 years ago

I assume (but please confirm) that you don't receive duplicate messages running the script from the command line.

Looking at the posted postfix log I see 4 messages being picked up (693405B1B3, 8B37D5B1B5, 205575B1B7, E73855B1B9) and each sent exactly once.

If your bash file writes to stdout the text will be added by to the znapzend journal - a behavior I find highly useful.

As a next step I would add the line "mail_body" to the script. The expected behavior is to add a line to the journal for each pre/post command execution (each will execute in a different process). This should clarify if the scripts indeed "seem to be running per logs once each".

Hopefully that adds some insight.

kevdogg commented 2 years ago

So perhaps it's not the program but I'm really stumped to see what other process is calling this script file.

If I run the script manually from the command line - only one email is sent. This is confirmed. So I have two scripts (start, end script), so each script is being triggered twice.

I added the line "mail body" to the end of the script and the journalctl output was as follows:

-- Journal begins at Fri 2019-11-08 22:32:38 CST, ends at Fri 2021-10-29 18:18:02 CDT. --
Oct 29 18:00:00 arch-TM znapzend[85745]: snapshot worker for zroot/data/timemachine spawned (537196)
Oct 29 18:00:00 arch-TM znapzend[537196]: creating recursive snapshot on zroot/data/timemachine
Oct 29 18:00:00 arch-TM znapzend[537196]: checking for explicitly excluded ZFS dependent datasets under 'zroot/data/timemachine'
Oct 29 18:00:00 arch-TM znapzend[85745]: snapshot worker for zroot/data/timemachine done (537196)
Oct 29 18:00:00 arch-TM znapzend[85745]: send/receive worker for zroot/data/timemachine spawned (537202)
Oct 29 18:00:00 arch-TM znapzend[537202]: starting work on backupSet zroot/data/timemachine
Oct 29 18:00:00 arch-TM znapzend[537202]: running pre-send-command for root@10.0.1.197:tank/backups/zfs_backup/arch-TM
Oct 29 18:00:00 arch-TM znapzend[537204]: Beginning ZFS backup from ARCH-TM-(10.0.1.107) to FreeNAS @ Fri Oct 29 18:00:00 CDT 2021
Oct 29 18:00:00 arch-TM znapzend[537202]: sending snapshots from zroot/data/timemachine to root@10.0.1.197:tank/backups/zfs_backup/arch-TM
Oct 29 18:00:03 arch-TM znapzend[537202]: cleaning up snapshots recursively under destination root@10.0.1.197:tank/backups/zfs_backup/arc>
Oct 29 18:00:04 arch-TM znapzend[537202]: now will look if there is anything to clean in children of destination root@10.0.1.197:tank/bac>
Oct 29 18:00:04 arch-TM znapzend[537202]: running post-send-command for root@10.0.1.197:tank/backups/zfs_backup/arch-TM
Oct 29 18:00:04 arch-TM znapzend[537300]: Elapsed Time: 0d, 0h, 0m, 4s
Oct 29 18:00:04 arch-TM znapzend[537300]: ZFS backup from arch-TM-(10.0.1.107) to FreeNAS-(10.0.1.197) concluded @ Friday Oct 29 2021  6:>
Oct 29 18:00:04 arch-TM znapzend[537300]: Summary...
Oct 29 18:00:04 arch-TM znapzend[537300]: Elapsed Time: 0 days, 0 hours(s), 0 minutes(s), 4 second(s)
Oct 29 18:00:04 arch-TM znapzend[537300]: Start Time: Friday Oct 29 2021  6:00:00 PM CDT
Oct 29 18:00:04 arch-TM znapzend[537300]: End Time: Friday Oct 29 2021  6:00:04 PM CDT
Oct 29 18:00:04 arch-TM znapzend[537202]: checking to clean up snapshots recursively from source zroot/data/timemachine
Oct 29 18:00:04 arch-TM znapzend[537202]: cleaning up 1 source snapshots recursively under zroot/data/timemachine
Oct 29 18:00:04 arch-TM znapzend[537202]: now will look if there is anything to clean in children of source zroot/data/timemachine
Oct 29 18:00:04 arch-TM znapzend[537202]: done with backupset zroot/data/timemachine in 4 seconds
Oct 29 18:00:04 arch-TM znapzend[85745]: send/receive worker for zroot/data/timemachine done (537202)

So some other process is calling these two script files -- at the exact same time as znapzend. systemd timers doesn't have anything, and I don't have cron/fcron installed -- so I'm kind of stumped. It's triggered every 6 hours like the znapzend service. Just strange for sure.

I further did some testing since things aren't making sense. I figured if there were two processes calling these script files -- I'll change the name of the script files. I changed the name of the script files. My original plan contained the following:

    dst_0_precmd = /root/bin/start_zfs_transfer.sh
    dst_0_pstcmd = /root/bin/end_zfs_transfer.sh

The new command is the following:

    dst_0_precmd = /root/bin/begin_zfs_transfer.sh
    dst_0_pstcmd = /root/bin/finish_zfs_transfer.sh

I reloaded the daemon via:

pkill -HUP znapzend

Immediately I was sent four emails although the znapzend service lists only two commands being sent:

-- Journal begins at Fri 2019-11-08 22:32:38 CST, ends at Fri 2021-10-29 19:25:35 CDT. --
Oct 29 19:15:59 arch-TM znapzend[85745]: SIGHUP received.
Oct 29 19:15:59 arch-TM znapzend[85745]: refreshing backup plans ...
Oct 29 19:16:00 arch-TM znapzend[85745]: running pre-send-command for root@10.0.1.197:tank/backups/zfs_backup/arch-TM
Oct 29 19:16:00 arch-TM znapzend[542053]: Beginning ZFS backup from ARCH-TM-(10.0.1.107) to FreeNAS @ Fri Oct 29 19:16:00 CDT 2021
Oct 29 19:16:00 arch-TM znapzend[85745]: found a valid backup plan for zroot/data/timemachine...
Oct 29 19:16:00 arch-TM znapzend[85745]: running post-send-command for root@10.0.1.197:tank/backups/zfs_backup/arch-TM
Oct 29 19:16:00 arch-TM znapzend[542076]: Elapsed Time: 0d, 0h, 0m, 0s
Oct 29 19:16:00 arch-TM znapzend[542076]: ZFS backup from arch-TM-(10.0.1.107) to FreeNAS-(10.0.1.197) concluded @ Friday Oct 29 2021  7:16:00>
Oct 29 19:16:00 arch-TM znapzend[542076]: Summary...
Oct 29 19:16:00 arch-TM znapzend[542076]: Elapsed Time: 0 days, 0 hours(s), 0 minutes(s), 0 second(s)
Oct 29 19:16:00 arch-TM znapzend[542076]: Start Time: Friday Oct 29 2021  7:16:00 PM CDT
Oct 29 19:16:00 arch-TM znapzend[542076]: End Time: Friday Oct 29 2021  7:16:00 PM CDT

postfix logs:

-- Journal begins at Fri 2019-11-08 22:32:38 CST, ends at Fri 2021-10-29 19:27:25 CDT. --
Oct 29 19:16:00 arch-TM postfix/pickup[541506]: 4D01F5B2C4: uid=0 from=<root>
Oct 29 19:16:00 arch-TM postfix/cleanup[542075]: 4D01F5B2C4: message-id=<20211030001600.4D01F5B2C4@arch-TM.domain.com>
Oct 29 19:16:00 arch-TM postfix/qmgr[86090]: 4D01F5B2C4: from=<root@arch-TM.domain.com>, size=481, nrcpt=1 (queue active)
Oct 29 19:16:00 arch-TM postfix/pickup[541506]: 539A15B2C6: uid=0 from=<root>
Oct 29 19:16:00 arch-TM postfix/cleanup[542075]: 539A15B2C6: message-id=<20211030001600.539A15B2C6@arch-TM.domain.com>
Oct 29 19:16:00 arch-TM postfix/qmgr[86090]: 539A15B2C6: from=<root@arch-TM.domain.com>, size=487, nrcpt=1 (queue active)
Oct 29 19:16:00 arch-TM postfix/pickup[541506]: 6C5755B2C8: uid=0 from=<root>
Oct 29 19:16:00 arch-TM postfix/cleanup[542075]: 6C5755B2C8: message-id=<20211030001600.6C5755B2C8@arch-TM.domain.com>
Oct 29 19:16:00 arch-TM postfix/qmgr[86090]: 6C5755B2C8: from=<root@arch-TM.domain.com>, size=703, nrcpt=1 (queue active)
Oct 29 19:16:00 arch-TM postfix/pickup[541506]: 758375B2CA: uid=0 from=<root>
Oct 29 19:16:00 arch-TM postfix/cleanup[542075]: 758375B2CA: message-id=<20211030001600.758375B2CA@arch-TM.domain.com>
Oct 29 19:16:00 arch-TM postfix/qmgr[86090]: 758375B2CA: from=<root@arch-TM.domain.com>, size=703, nrcpt=1 (queue active)
Oct 29 19:16:01 arch-TM postfix/smtp[542091]: 4D01F5B2C4: to=<recipient@gmail.com>, relay=smtp.gmail.com[142.250.128.108]:465, delay=1.2, dela>
Oct 29 19:16:01 arch-TM postfix/qmgr[86090]: 4D01F5B2C4: removed
Oct 29 19:16:01 arch-TM postfix/smtp[542124]: 6C5755B2C8: to=<recipient@gmail.com>, relay=smtp.gmail.com[142.250.128.108]:465, delay=1.3, dela>
Oct 29 19:16:01 arch-TM postfix/qmgr[86090]: 6C5755B2C8: removed
Oct 29 19:16:01 arch-TM postfix/smtp[542111]: 539A15B2C6: to=<recipient@gmail.com>, relay=smtp.gmail.com[142.250.128.108]:465, delay=1.5, dela>
Oct 29 19:16:01 arch-TM postfix/qmgr[86090]: 539A15B2C6: removed
Oct 29 19:16:02 arch-TM postfix/smtp[542131]: 758375B2CA: to=<recipient@gmail.com>, relay=smtp.gmail.com[142.250.128.108]:465, delay=1.9, dela>
Oct 29 19:16:02 arch-TM postfix/qmgr[86090]: 758375B2CA: removed

I kind of stumped on this one.

jochendemuth commented 2 years ago

do you happen to accidentally have two instances of znapzend running?

pgrep -a znapzend

kevdogg commented 2 years ago

I'm going to close this issue.

Yes it seemed I had two issues of znapzend running. Not sure how that happened. Thanks for suggestion.