oetiker / znapzend

zfs backup with remote capabilities and mbuffer integration.
www.znapzend.org
GNU General Public License v3.0
609 stars 137 forks source link

fix race condition in snap/send workers with temp pid value (and new line of log) #536

Closed olidal closed 3 years ago

olidal commented 3 years ago

As shown on the following log excerpt, I had duplicate snapshoting and sending actions. Those seem to be due to a race-condition: a process could decide that a job was still to be done because it found the snapPid/sendPid value to be still equal to zero while a fork was already initiated for the corresponding tasks. My fix is quite simple: assign a non-zero value to these variables before starting the fork. Since at this time the new process id is not yet known, I just assign a value that is not available for a pid (I believe) : ~0 (binary bitwise opposite of zero); I havent extensively tested potential side effects, but at least this seems to fix the duplicate issue I had.

Log excerpt showing the snapshot duplicates:

-- Logs begin at Sun 2020-12-13 12:00:00 CET, end at Sat 2020-12-19 08:51:01 CET. --
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/wordpress spawned (7559)
Dec 19 08:00:00 safer-backend znapzend[7559]: creating snapshot on zpool/priv/wordpress
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/shared/meta spawned (7560)
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/openldap spawned (7561)
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/shared/continuous spawned (7563)
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/nextcloud spawned (7564)
Dec 19 08:00:00 safer-backend znapzend[7561]: creating snapshot on zpool/priv/openldap
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/postgres spawned (7566)
Dec 19 08:00:00 safer-backend znapzend[7560]: creating snapshot on zpool/shared/meta
Dec 19 08:00:00 safer-backend znapzend[7563]: creating snapshot on zpool/shared/continuous
Dec 19 08:00:00 safer-backend znapzend[7564]: creating snapshot on zpool/priv/nextcloud
Dec 19 08:00:00 safer-backend znapzend[7566]: creating snapshot on zpool/priv/postgres
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/postgres spawned (7577)
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/openldap spawned (7578)
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/nextcloud spawned (7579)
Dec 19 08:00:00 safer-backend znapzend[7577]: creating snapshot on zpool/priv/postgres
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/wordpress spawned (7580)
Dec 19 08:00:00 safer-backend znapzend[7579]: creating snapshot on zpool/priv/nextcloud
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/shared/continuous spawned (7582)
Dec 19 08:00:00 safer-backend znapzend[7578]: creating snapshot on zpool/priv/openldap
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/shared/meta spawned (7585)
Dec 19 08:00:00 safer-backend znapzend[7580]: creating snapshot on zpool/priv/wordpress
Dec 19 08:00:00 safer-backend znapzend[7582]: creating snapshot on zpool/shared/continuous
Dec 19 08:00:00 safer-backend znapzend[7585]: creating snapshot on zpool/shared/meta
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/wordpress done (7559)
Dec 19 08:00:00 safer-backend znapzend[8894]: send/receive worker for zpool/priv/wordpress spawned (7595)
Dec 19 08:00:00 safer-backend znapzend[7595]: waiting 34 seconds before sending snaps on backupSet zpool/priv/wordpress...
Dec 19 08:00:01 safer-backend znapzend[8894]: cannot create snapshot 'zpool/priv/wordpress@2020-12-19-080000': dataset already exists
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/priv/openldap done (7561)
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/priv/postgres done (7566)
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/shared/meta done (7560)
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/priv/nextcloud done (7564)
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/shared/continuous done (7563)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/priv/openldap spawned (7597)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/priv/postgres spawned (7598)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/shared/meta spawned (7599)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/priv/nextcloud spawned (7600)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/shared/continuous spawned (7601)
Dec 19 08:00:01 safer-backend znapzend[7599]: waiting 34 seconds before sending snaps on backupSet zpool/shared/meta...
Dec 19 08:00:01 safer-backend znapzend[7600]: waiting 34 seconds before sending snaps on backupSet zpool/priv/nextcloud...
Dec 19 08:00:01 safer-backend znapzend[7597]: waiting 34 seconds before sending snaps on backupSet zpool/priv/openldap...
Dec 19 08:00:01 safer-backend znapzend[7598]: waiting 34 seconds before sending snaps on backupSet zpool/priv/postgres...
Dec 19 08:00:01 safer-backend znapzend[7601]: waiting 34 seconds before sending snaps on backupSet zpool/shared/continuous...
Dec 19 08:00:01 safer-backend znapzend[7580]: snapshot 'zpool/priv/wordpress@2020-12-19-080000' does already exist. skipping one round...
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/priv/wordpress done (7580)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/priv/wordpress spawned (7603)
Dec 19 08:00:01 safer-backend znapzend[7603]: waiting 34 seconds before sending snaps on backupSet zpool/priv/wordpress...
Dec 19 08:00:02 safer-backend znapzend[8894]: cannot create snapshot 'zpool/shared/continuous@2020-12-19-080000': dataset already exists
Dec 19 08:00:02 safer-backend znapzend[7582]: snapshot 'zpool/shared/continuous@2020-12-19-080000' does already exist. skipping one round...
Dec 19 08:00:02 safer-backend znapzend[8894]: snapshot worker for zpool/shared/continuous done (7582)
Dec 19 08:00:02 safer-backend znapzend[8894]: send/receive worker for zpool/shared/continuous spawned (7606)
Dec 19 08:00:02 safer-backend znapzend[7606]: waiting 34 seconds before sending snaps on backupSet zpool/shared/continuous...
Dec 19 08:00:03 safer-backend znapzend[8894]: cannot create snapshot 'zpool/shared/meta@2020-12-19-080000': dataset already exists
Dec 19 08:00:03 safer-backend znapzend[8894]: cannot create snapshot 'zpool/priv/postgres@2020-12-19-080000': dataset already exists
Dec 19 08:00:03 safer-backend znapzend[8894]: cannot create snapshot 'zpool/priv/nextcloud@2020-12-19-080000': dataset already exists
Dec 19 08:00:03 safer-backend znapzend[8894]: cannot create snapshot 'zpool/priv/openldap@2020-12-19-080000': dataset already exists
Dec 19 08:00:03 safer-backend znapzend[7585]: snapshot 'zpool/shared/meta@2020-12-19-080000' does already exist. skipping one round...
Dec 19 08:00:03 safer-backend znapzend[8894]: snapshot worker for zpool/shared/meta done (7585)

The following shows similar issue observed with send duplicate worker:

Dec 19 08:00:38 safer-backend znapzend[7603]: ERROR: cannot send snapshots to zpool/user-7CF1/zpool/wordpress on user-7CF1@hostname.removed
Dec 19 08:00:38 safer-backend znapzend[7603]: ERROR: suspending cleanup source dataset zpool/priv/wordpress because 1 send task(s) failed:
Dec 19 08:00:38 safer-backend znapzend[7603]:  +-->   ERROR: cannot send snapshots to zpool/user-7CF1/zpool/wordpress on user-7CF1@hostname.removed
Dec 19 08:00:38 safer-backend znapzend[7603]: done with backupset zpool/priv/wordpress in 3 seconds
olidal commented 3 years ago

Arg. I messed up and inadvertantly pushed another change I made in this PR branch. It is pretty harmless: It adds generation of a new log line that shows the names of the snapshots being sent (and not only the dataset). this is actually useful for computing the actual size of the data being sent by simply parsing the logs and running running zfs send -Pvn on corresponding log lines.

olidal commented 3 years ago

I still observe duplicates, the proposed change does not seem to be fixing the issue.