autoCreation broken in 0.22.0.

stuckj commented 1 month ago

Howdie,

I've been using znapzend for several years. Thanks BTW, it's been a big help!

TL;DR:

Autocreation is broken in 0.22.0.
New enable-dst-autocreation command for znapzendzetup is also broken UNLESS the autocreation property already exists on the dataset.
New autocreation property (set manually) doesn't fix auto-creation...still doesn't work between datasets.
Script demonstrating problem is below.
Outputs for the script from several versions below.

Details:

While setting up a new desktop I noticed that the latest version from git (and 0.22.0) appears to have broken autoCreation. Specifically, if you have a simple setup with a recursive backup definition at the root and propagated to all sub-datasets, the initial backup to a destination will not auto-create those sub-datasets. Also, if you make new datasets in the source they won't get sent to the destination.

Furthermore, the new enable-dst-autocreation feature on znapzendzetup only works if the autocreation property already exists on the dataset. It won't set te property if it's not there already.

My laptop has 0.21.1 installed and it properly auto-creates sub-dataset to my server. The new desktop is running proxmox and I'm using my znapzend docker wrapper for proxmox (https://hub.docker.com/r/stuckj/znapzend-proxmox) which currently uses the latest from git (going to change that to 0.21.2 since discovering this). That's how I noticed the discrepancy since it works fine on my laptop, but not on the new desktop.

I made a super-simple script to demonstrate the problem (autocreate-test.sh.gz). You must run it as root since it creates test pools (and destroys them first). Do look it over before running it just in case you have existing pools named test1 or test2. :-P

Running that script on 0.21.1 or 0.21.2 successfully replicates the pool structure between test1 -> test2. On 0.22.0+ (checked out directly and built locally) it fails to create the sub-datasets. Only the root dataset is replicated.

I noticed in the diff that there were changes adding in autoCreation zfs properties for the destination. I tried setting those with znapzendzetup enable-dst-autocreation test1 nas, but it did nothing. This appears to be a separate bug. So, I MANUALLY set the property (based on what the code looked like it wanted) with zfs set org.znapzend:dst_nas_autocreation=on test1. That still doesn't work though...same error about the sub-destination not existing.

Below are the log outputs to versions 0.21.1, 0.21.2, 0.22.0, and the latest in git. It only works properly in versions 0.21.1 and 0.21.2 and appears to have broken in version 0.22.0.

Log Output

0.21.1 output: autocreate-results-0.21.1.log 0.21.2 output: autocreate-results-0.21.2.log 0.22.0 output: autocreate-results-0.22.0.log git latest output: autocreate-results-latest.log

Test Script

#!/bin/bash

TEST_FILE1=./test1.img
TEST_FILE1_FULL=$(realpath -s "${TEST_FILE1}")
TEST_FILE2=./test2.img
TEST_FILE2_FULL=$(realpath -s "${TEST_FILE2}")
TEST_SIZE=300M

POOL1=test1
SUB1=${POOL1}/sub1
SUB1_2=${SUB1}/sub2
SUB2=${POOL1}/sub2
POOL2=test2

set -x

# Pool creation...
zpool destroy ${POOL1}
rm -f "${TEST_FILE1_FULL}"
truncate -s ${TEST_SIZE} "${TEST_FILE1_FULL}"
zpool create ${POOL1} "${TEST_FILE1_FULL}"
zfs create ${SUB1}
zfs create ${SUB1_2}
zfs create ${SUB2}

zpool destroy ${POOL2}
rm -f "${TEST_FILE2_FULL}"
truncate -s ${TEST_SIZE} "${TEST_FILE2_FULL}"
zpool create ${POOL2} "${TEST_FILE2_FULL}"

# Znapzend setup...
znapzendzetup create --donotask --tsformat='%Y%m%dT%H%M%S' --recursive SRC '1h=>15min,2d=>1h,14d=>1d,3m=>1w' test1 DST:nas '1h=>15min,2d=>1h,14d=>1d,3m=>1w,1y=>1m' test2

# Try the new autoCreation property...
znapzendzetup enable-dst-autocreation ${POOL1} nas

# Output zfs properties...
zfs get all ${POOL1} | grep org.znapzend

# Manually set the property since it doesn't work from znapzendzetup.
zfs set org.znapzend:dst_nas_autocreation=on ${POOL1}

# Output zfs properties...
zfs get all ${POOL1} | grep org.znapzend

# Znapzend test...
znapzend --debug --logto=/dev/stdout --runonce=${POOL1} --autoCreation

# Output datasets...
zfs list -r ${POOL1} ${POOL2}

oetiker commented 1 month ago

@jimklimov does this ring a bell ?

oetiker commented 1 month ago

@stuckj thank you for the report. If you could add to our test suite we could make sure this remains fixed

jimklimov commented 1 month ago

No, sorry - did not misbehave like that for me :\ Maybe I tried with somewhat different use-cases. Will try to investigate in the coming days

stuckj commented 1 month ago

@stuckj thank you for the report. If you could add to our test suite we could make sure this remains fixed

Do you mean add it as a perl test? I can take a stab at that. If I'm understanding the current test design, it looks like all of the external tools are mocked out to store env vars based on how they're called. And then the tests just verify that the tools were called with the expected parameters, right? I can figure out how the working instances from 0.21.2 were working vs 0.22.0 and see if I can modify the tests to handle the case.

I'll try to find some evening time to work on it (always a challenge with work and kids, but I'll give it a go).

oetiker commented 1 month ago

@stuckj thank you for the report. If you could add to our test suite we could make sure this remains fixed

Do you mean add it as a perl test? I can take a stab at that. If I'm understanding the current test design, it looks like all of the external tools are mocked out to store env vars based on how they're called. And then the tests just verify that the tools were called with the expected parameters, right? I can figure out how the working instances from 0.21.2 were working vs 0.22.0 and see if I can modify the tests to handle the case.

yes exactly, the tests mock everyting so the mocks may require enhancement

jimklimov commented 1 month ago

So, dug a bit more into this. About the znapzendzetup enable-dst-autocreation ${POOL1} nas not doing much - gotta be a copy-paste error, fixed.

Regarding the main issue, in sendRecvCleanup() we check if (!$backupSet->{"dst_$key" . '_valid'}) {...} (e.g. target exists) and handle my $autoCreation there -- skipping the target or creating it.

This part checks the backupSet with the directly defined retention schedule, e.g. the test1 pool in this example.

Then we get into the "sending loop through all subdatasets", where we also juggle (another evaluation of) autoCreation as well as sendRaw. Maybe this mix of checks is what misifres, some and or not mis-placed?

At least, in this block:

            # Time to check if the target sub-dataset exists
            # at all (unless we would auto-create one anyway).
            if ((!$autoCreation || !$self->sendRaw) && !($self->zZfs->dataSetExists($dstDataSet))) {
                my $errmsg = "sub-destination '" . $dstDataSet
                    . "' does not exist or is offline; ignoring it for this round... Consider "
                    . ( $autoCreation || $self->sendRaw ? "" : "running znapzend --autoCreation or " )
                    . "disabling this dataset from znapzend handling.";

changing || to &&

            if ((!$autoCreation && !$self->sendRaw) && !($self->zZfs->dataSetExists($dstDataSet))) {

... does help the test script succeed:

...
[2024-06-02 19:52:24.06096] [130998] [debug] === getSnapshotProperties(): Stopping recursion after test1/sub1@20240602T195223, we have all the properties we needed
[2024-06-02 19:52:24.06110] [130998] [debug] not cleaning up source test1/sub1@20240602T195223 because it is needed by test2/sub1
[2024-06-02 19:52:24.06119] [130998] [debug] got nothing to clean in source test1/sub1
[2024-06-02 19:52:24.06127] [130998] [info] done with backupset test1 in 1 seconds
[2024-06-02 19:52:24.06217] [130958] [debug] send/receive worker for test1 done (130998)
znapzend (PID=130958) is done.
+ zfs list -r test1 test2
NAME              USED  AVAIL     REFER  MOUNTPOINT
test1             277K   160M       25K  /test1
test1/sub1         48K   160M       24K  /test1/sub1
test1/sub1/sub2    24K   160M       24K  /test1/sub1/sub2
test1/sub2         24K   160M       24K  /test1/sub2
test2             276K   160M       25K  /test2
test2/sub1         62K   160M       24K  /test2/sub1
test2/sub1/sub2    24K   160M       24K  /test2/sub1/sub2
test2/sub2         24K   160M       24K  /test2/sub2

that line probably originated as (or meant to be) a different boolean combo, but was mis-adapted:

        if ( !($autoCreation || $self->sendRaw) && !($self->zZfs->dataSetExists($dstDataSet))) {

Gotta sit closely and rectify when and how we want the impact from sendRaw wherever it is mentioned.

@oetiker : IIRC it is about encrypted datasets; do we want those always sent (effectively auto-created by virtue of being encrypted)?

jimklimov commented 1 month ago

Looking at history in CHANGES.old about sendRaw mentions, the ideas were:

Add sendRaw as a feature to enable sending encrypted datasets without decrypting to untrusted backup hosts
Do not use the -F receive option when sendRaw feature is used, zfs receive -F cannot be used to destroy an encrypted filesystem or overwrite an unencrypted one with an encrypted one
Do not automatically create destination when using feature sendRaw, receiving a raw encrypted stream is not supported on unencrypted datasets
Do not send snapshot to destination when sendRaw feature is used and destination dataset doesn't exist (causing zfs to automatically create the destination dataset), when the autoCreation option isn't used

The latter entry should have been about my recent efforts in the area; earlier ones come IIRC from different authors.

jimklimov commented 1 month ago

I guess in this clause, we have no impact from sendRaw being set or not. If not told to auto-create and destination does not exist - skip. Right?

Or maybe not. If autocreating, but also a dataset is sendRaw, still avoid creating it (per third changelog line above).

jimklimov commented 1 month ago

I wonder if 0f6639e5 in PR #491 was right in this clause:

https://github.com/oetiker/znapzend/pull/496/commits/0f6639e5933d76e450b6ea83a639716a9c5e57f4#diff-4c777e431e6c30149fe108e6414fcc5e7f750026583799b3861ff40555e2d07eR336

Not in direct scope of this issue at hand, and short on time to think more of it, but I am not sure now if that would do what the change description intends.

jimklimov commented 1 month ago

PR #657 posted.

And I tender my apologies about probably botching this up in the first place :\

jimklimov commented 1 month ago

By the way, @stuckj : would you mind posting a PR to add your repro script into https://github.com/oetiker/znapzend/tree/master/contrib - even "as is" it can be a helpful foundation for developer testing iterations. Did help me while fixing the issue :)

stuckj commented 1 month ago

@jimklimov, absolutely. Here's the PR: https://github.com/oetiker/znapzend/pull/658.

Sorry, I have not had a chance to integrate this into the test suite. We have end of school year madness with my kids right now along with sickness with the same kids. Looks like winter-time bugs are running into summer this year. :-P

oetiker / znapzend