oetiker / znapzend

zfs backup with remote capabilities and mbuffer integration.
www.znapzend.org
GNU General Public License v3.0
608 stars 138 forks source link

zfs dataset disappearing on backup when znapzend runs #435

Closed maniac0s closed 3 years ago

maniac0s commented 4 years ago

My storage server on Ubuntu 18.04.3 LTS was now running for half a year, quite stable. It's using zfs (raidZ) for the storage partition.

The server runs somewhat as a backup mirror to another server that uses znapzend to send over snapshots frequently. (I somewhat have the feeling the pool disappears when znapzend starts sending however why only the storage dataset and not the database, I don't know) Both servers should have the same zfs pool-setup if I remember correctly, including snapshots and quota. Both servers recently got a complete system update. The "main" server shows no issues with the datasets, they are not disappearing there.

The pools on the backup have refreservation and refquota set to keep enough space for snapshots.

pool                          quota           none       default
pool                          refquota        none       default
pool                          refreservation  none       default
pool/db                       quota           none       default
pool/db                       refquota        100G       local
pool/db                       refreservation  100G       local
pool/storage                  quota           none       default
pool/storage                  refquota        2T         local
pool/storage                  refreservation  2T         local

After latest update of the whole Ubuntu system, pool/storage keeps disappearing from filesystem but is still listed in zfs.

root@server:~# mount | grep pool
pool on /pool type zfs (rw,xattr,noacl)
pool/db on /pool/db type zfs (rw,xattr,noacl)
root@server:~# zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
pool        2.11T  1.40T   104K  /pool
pool/db   109G  90.6G  9.40G  /pool/db
pool/storage  2.01T  1.39T   629G  /pool/storage

My monitoring reports a few times a day pool/storage returning to the filesystem and then disappearing again:

Report from 5:21, dataset came back:

Host:     backup
Alias:    backup
Address:  192.168.1.11
Service:  Filesystem /pool/storage
Event:    UNKN -> OK
Output:   OK - 52.9% used (1.06 of 2.00 TB), trend: +342.55 GB / 24 hours
Perfdata: /pool/storage=1109740.5;1677721.6;1887436.8;0;2097152 fs_size=2097152;;;; growth=4962844.988964;;;; trend=350775.45978;;;0;87381.333333

Disappears again at 6:07:

Host:     backup
Alias:    backup
Address:  192.168.1.11
Service:  Filesystem /pool/storage
Event:    OK -> UNKN
Output:   UNKN - filesystem not found
Perfdata: 

I don't see anything wrogng in zpool status either

root@server:~# zpool status
  pool: pool
 state: ONLINE
  scan: scrub repaired 0B in 4h43m with 0 errors on Tue Aug 27 15:01:28 2019
config:

                NAME        STATE     READ WRITE CKSUM
                pool        ONLINE       0     0     0
                  raidz1-0  ONLINE       0     0     0
                    sda     ONLINE       0     0     0
                    sdb     ONLINE       0     0     0

What's going on here? What can I do to investigate into this behavior?

moetiker commented 4 years ago

Do you think this has anything to do with znapzend.... the best way to find out is to run the commands that znapzend using manual in a terminal and check if you can reproduce the behavior...
znapzend -d --runonce=/pool/storage you can check the commands

maniac0s commented 4 years ago

PS: sure it's "znapzend -d --runonce=/pool/storage"? Because that gives "ERROR: filesystem /pool/storage does not exist". "--runonce=pool/storage" (without leading /) works however

maniac0s commented 4 years ago

Well, that was quicker than expected. I created the pool and the datasets new from scratch and it just vanished again:

root@backup:~# zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
pool        2.10T  1.41T   104K  /pool
pool/db   100G   100G    96K  /pool/db
pool/storage     2T  2.00T    96K  /pool/storage
root@backup:~# mount | grep pool
pool on /pool type zfs (rw,xattr,noacl)
pool/db on /pool/db type zfs (rw,xattr,noacl)
pool/storage on /pool/storage type zfs (rw,xattr,noacl)

After starting znapzend --runonce it disappeared again.

root@backup:~# mount | grep pool
pool on /pool type zfs (rw,xattr,noacl)
pool/db on /pool/db type zfs (rw,xattr,noacl)

zpool history shows nothing unusual either...

root@backup:~# zpool history pool
History for 'pool':
2019-09-16.12:21:50 zpool create pool raidz /dev/sda /dev/sdb
2019-09-16.12:22:43 zfs create pool/db
2019-09-16.12:22:48 zfs create pool/storage
2019-09-16.12:23:11 zfs set refreservation=100G pool/db
2019-09-16.12:23:11 zfs set refreservation=2.0T pool/storage
2019-09-16.12:23:11 zfs set refquota=2.0T pool/storage
2019-09-16.12:23:16 zfs set refquota=100G pool/db

Edit: Output of znapzend:

root@server:~# znapzend -d --runonce=pool/storage
[Mon Sep 16 12:25:18 2019] [info] znapzend (PID=15732) starting up ...
[Mon Sep 16 12:25:18 2019] [info] refreshing backup plans...
[Mon Sep 16 12:25:20 2019] [info] found a valid backup plan for pool/storage...
[Mon Sep 16 12:25:20 2019] [info] znapzend (PID=15732) initialized -- resuming normal operations.
[Mon Sep 16 12:25:20 2019] [debug] snapshot worker for pool/storage spawned (15884)
[Mon Sep 16 12:25:20 2019] [info] creating recursive snapshot on pool/storage
# zfs snapshot -r pool/storage@2019-09-16-122520
cannot create snapshot 'pool/storage@2019-09-16-122520': out of space
no snapshots were created
# zfs list -H -o name -t snapshot pool/storage@2019-09-16-122520
cannot open 'pool/storage@2019-09-16-122520': dataset does not exist
[Mon Sep 16 12:25:31 2019] [warn] taking snapshot on pool/storage failed: ERROR: cannot create snapshot pool/storage@2019-09-16-122520
[Mon Sep 16 12:25:31 2019] [debug] snapshot worker for pool/storage done (15884)
[Mon Sep 16 12:25:31 2019] [debug] send/receive worker for pool/storage spawned (17838)
[Mon Sep 16 12:25:31 2019] [info] starting work on backupSet pool/storage
# zfs list -H -r -o name -t filesystem,volume pool/storage
[Mon Sep 16 12:25:31 2019] [debug] sending snapshots from pool/storage to root@192.168.1.11:pool/storage
# zfs list -H -o name -t snapshot -s creation -d 1 pool/storage
# ssh -o batchMode=yes -o ConnectTimeout=30 root@192.168.1.11 zfs list -H -o name -t snapshot -s creation -d 1 pool/storage
# zfs send pool/storage@2019-08-29-210000|ssh -o batchMode=yes -o ConnectTimeout=30 'root@192.168.1.11' 'zfs recv -F pool/storage'

It says out of space but I don't know if that happens after the dataset disappeared or before?

maniac0s commented 4 years ago

I just recreated the whole pool once again and ran rsync on /pool/storage to the backup server instead of znapzend and there been no issues. So I guess that the issue is either in znapzend or in the underlying zfs send mechanism?

jimklimov commented 4 years ago

Didn't notice this one before. In most of the posts and "screenshots" you mention that it vanishes, but seemingly in the context of being mounted or not. It does not help that your source and backup pools seem named the same and used interchangeably in the message above ;)

If it remains in the ZFS dataset tree (per zfs list -r pool and disk space usage, as seems to be the case in original post for the original server) then its unmounting but not deletion on the backup server may be due to zfs recv -u flag being used to not-mount destination datasets after receiving, and maybe that ZFS has to roll destination back to latest snapshot to drop changes (e.g. small mess due to atime=on by default) and receive an incremental snapshot (perhaps ZoL unmounts it to do so... and never remounts due to -u).

Unmounting on the source server is not something I've seen on Solaris/illumos systems, even if a pool or dataset (quota'd) runs out of space and can not make snapshots nor delete files from a "live" dataset (if there are older snapshots referencing these files, and it tries to reassign blocks, and can't write that), but maybe ZoL is different in this regard and can unmount stuff on error.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.