Fix race condition again

olidal commented 3 years ago

(this is the second attempt at fixing this issue today. Previous PR for this same issue was ineffective.)

As shown on the following log excerpt, I had duplicate snapshoting and sending actions. Those seem to be due to a race-condition: a process could decide that a job was still to be done because it found the snapPid/sendPid value to be still equal to zero while a fork was already initiated for the corresponding tasks. My fix is quite simple: assign a non-zero value to these variables at the time the work is scheluled. Since at this time the new process id is not yet known, I just assign a value that is not available for a pid (I believe) : ~0 (binary bitwise opposite of zero); I havent extensively tested potential side effects, but at least this seems to fix the duplicate issue I had.

Log excerpt showing the snapshot duplicates:

-- Logs begin at Sun 2020-12-13 12:00:00 CET, end at Sat 2020-12-19 08:51:01 CET. --
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/wordpress spawned (7559)
Dec 19 08:00:00 safer-backend znapzend[7559]: creating snapshot on zpool/priv/wordpress
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/shared/meta spawned (7560)
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/openldap spawned (7561)
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/shared/continuous spawned (7563)
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/nextcloud spawned (7564)
Dec 19 08:00:00 safer-backend znapzend[7561]: creating snapshot on zpool/priv/openldap
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/postgres spawned (7566)
Dec 19 08:00:00 safer-backend znapzend[7560]: creating snapshot on zpool/shared/meta
Dec 19 08:00:00 safer-backend znapzend[7563]: creating snapshot on zpool/shared/continuous
Dec 19 08:00:00 safer-backend znapzend[7564]: creating snapshot on zpool/priv/nextcloud
Dec 19 08:00:00 safer-backend znapzend[7566]: creating snapshot on zpool/priv/postgres
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/postgres spawned (7577)
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/openldap spawned (7578)
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/nextcloud spawned (7579)
Dec 19 08:00:00 safer-backend znapzend[7577]: creating snapshot on zpool/priv/postgres
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/wordpress spawned (7580)
Dec 19 08:00:00 safer-backend znapzend[7579]: creating snapshot on zpool/priv/nextcloud
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/shared/continuous spawned (7582)
Dec 19 08:00:00 safer-backend znapzend[7578]: creating snapshot on zpool/priv/openldap
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/shared/meta spawned (7585)
Dec 19 08:00:00 safer-backend znapzend[7580]: creating snapshot on zpool/priv/wordpress
Dec 19 08:00:00 safer-backend znapzend[7582]: creating snapshot on zpool/shared/continuous
Dec 19 08:00:00 safer-backend znapzend[7585]: creating snapshot on zpool/shared/meta
Dec 19 08:00:00 safer-backend znapzend[8894]: snapshot worker for zpool/priv/wordpress done (7559)
Dec 19 08:00:00 safer-backend znapzend[8894]: send/receive worker for zpool/priv/wordpress spawned (7595)
Dec 19 08:00:00 safer-backend znapzend[7595]: waiting 34 seconds before sending snaps on backupSet zpool/priv/wordpress...
Dec 19 08:00:01 safer-backend znapzend[8894]: cannot create snapshot 'zpool/priv/wordpress@2020-12-19-080000': dataset already exists
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/priv/openldap done (7561)
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/priv/postgres done (7566)
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/shared/meta done (7560)
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/priv/nextcloud done (7564)
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/shared/continuous done (7563)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/priv/openldap spawned (7597)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/priv/postgres spawned (7598)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/shared/meta spawned (7599)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/priv/nextcloud spawned (7600)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/shared/continuous spawned (7601)
Dec 19 08:00:01 safer-backend znapzend[7599]: waiting 34 seconds before sending snaps on backupSet zpool/shared/meta...
Dec 19 08:00:01 safer-backend znapzend[7600]: waiting 34 seconds before sending snaps on backupSet zpool/priv/nextcloud...
Dec 19 08:00:01 safer-backend znapzend[7597]: waiting 34 seconds before sending snaps on backupSet zpool/priv/openldap...
Dec 19 08:00:01 safer-backend znapzend[7598]: waiting 34 seconds before sending snaps on backupSet zpool/priv/postgres...
Dec 19 08:00:01 safer-backend znapzend[7601]: waiting 34 seconds before sending snaps on backupSet zpool/shared/continuous...
Dec 19 08:00:01 safer-backend znapzend[7580]: snapshot 'zpool/priv/wordpress@2020-12-19-080000' does already exist. skipping one round...
Dec 19 08:00:01 safer-backend znapzend[8894]: snapshot worker for zpool/priv/wordpress done (7580)
Dec 19 08:00:01 safer-backend znapzend[8894]: send/receive worker for zpool/priv/wordpress spawned (7603)
Dec 19 08:00:01 safer-backend znapzend[7603]: waiting 34 seconds before sending snaps on backupSet zpool/priv/wordpress...
Dec 19 08:00:02 safer-backend znapzend[8894]: cannot create snapshot 'zpool/shared/continuous@2020-12-19-080000': dataset already exists
Dec 19 08:00:02 safer-backend znapzend[7582]: snapshot 'zpool/shared/continuous@2020-12-19-080000' does already exist. skipping one round...
Dec 19 08:00:02 safer-backend znapzend[8894]: snapshot worker for zpool/shared/continuous done (7582)
Dec 19 08:00:02 safer-backend znapzend[8894]: send/receive worker for zpool/shared/continuous spawned (7606)
Dec 19 08:00:02 safer-backend znapzend[7606]: waiting 34 seconds before sending snaps on backupSet zpool/shared/continuous...
Dec 19 08:00:03 safer-backend znapzend[8894]: cannot create snapshot 'zpool/shared/meta@2020-12-19-080000': dataset already exists
Dec 19 08:00:03 safer-backend znapzend[8894]: cannot create snapshot 'zpool/priv/postgres@2020-12-19-080000': dataset already exists
Dec 19 08:00:03 safer-backend znapzend[8894]: cannot create snapshot 'zpool/priv/nextcloud@2020-12-19-080000': dataset already exists
Dec 19 08:00:03 safer-backend znapzend[8894]: cannot create snapshot 'zpool/priv/openldap@2020-12-19-080000': dataset already exists
Dec 19 08:00:03 safer-backend znapzend[7585]: snapshot 'zpool/shared/meta@2020-12-19-080000' does already exist. skipping one round...
Dec 19 08:00:03 safer-backend znapzend[8894]: snapshot worker for zpool/shared/meta done (7585)

The following shows similar issue observed with send duplicate worker:

Dec 19 08:00:38 safer-backend znapzend[7603]: ERROR: cannot send snapshots to zpool/user-7CF1/zpool/wordpress on user-7CF1@hostname.removed
Dec 19 08:00:38 safer-backend znapzend[7603]: ERROR: suspending cleanup source dataset zpool/priv/wordpress because 1 send task(s) failed:
Dec 19 08:00:38 safer-backend znapzend[7603]:  +-->   ERROR: cannot send snapshots to zpool/user-7CF1/zpool/wordpress on user-7CF1@hostname.removed
Dec 19 08:00:38 safer-backend znapzend[7603]: done with backupset zpool/priv/wordpress in 3 seconds

olidal commented 3 years ago

After quite a few try and error I finally found the real cause of the race conditions: znapzend daemon was receiving a SIGHUP right after starting up (possibly part of my systemd spawn process) which ended up with reentrance in the refresh section. The fix for that was simply adding an exmut flag. However I kept in this PR my previous additions inserted while attempting to fix the race as I believe they are no harm and may even prevent future race conditions (eg. outputing a log prior to setting a lock is dangerous if logging triggers an OS call that ends up preempting the current process). Even though it is unrelated, I also added a new log message that prints the names of the snapshots being sent, which is helpful to compute the amount of data being sent (using zfs send -Pvn)

oetiker commented 3 years ago

Thanks for all the work! But in order to keep the source as uncluttered as possible I would prefer if you could make the PR just fix the problem you actually observe ... could you provide an update?

oetiker commented 3 years ago

ping

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

oetiker / znapzend

Fix race condition again #537