Closed olidal closed 2 years ago
After quite a few try and error I finally found the real cause of the race conditions: znapzend daemon was receiving a SIGHUP right after starting up (possibly part of my systemd spawn process) which ended up with reentrance in the refresh section. The fix for that was simply adding an exmut flag. However I kept in this PR my previous additions inserted while attempting to fix the race as I believe they are no harm and may even prevent future race conditions (eg. outputing a log prior to setting a lock is dangerous if logging triggers an OS call that ends up preempting the current process). Even though it is unrelated, I also added a new log message that prints the names of the snapshots being sent, which is helpful to compute the amount of data being sent (using zfs send -Pvn)
Thanks for all the work! But in order to keep the source as uncluttered as possible I would prefer if you could make the PR just fix the problem you actually observe ... could you provide an update?
ping
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
(this is the second attempt at fixing this issue today. Previous PR for this same issue was ineffective.)
As shown on the following log excerpt, I had duplicate snapshoting and sending actions. Those seem to be due to a race-condition: a process could decide that a job was still to be done because it found the snapPid/sendPid value to be still equal to zero while a fork was already initiated for the corresponding tasks. My fix is quite simple: assign a non-zero value to these variables at the time the work is scheluled. Since at this time the new process id is not yet known, I just assign a value that is not available for a pid (I believe) : ~0 (binary bitwise opposite of zero); I havent extensively tested potential side effects, but at least this seems to fix the duplicate issue I had.
Log excerpt showing the snapshot duplicates:
The following shows similar issue observed with send duplicate worker: