ZFS Pool on top of ZVOL on Proxmox VE - EXTREME Overhead and Used Space reported by Proxmox VE Host

luckylinux commented 7 months ago

System information

Host (Proxmox VE):

Type	Version/Name
Distribution Name	Proxmox VE / Debian GNU/Linux
Distribution Version	Bookworm (12) with Proxmox VE Packages
Kernel Version	Linux pve16 6.5.13-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-3 (2024-03-20T10:45Z) x86_64 GNU/Linux
Architecture	x86_64 / amd64
OpenZFS Version	zfs-2.2.3-pve1 / zfs-kmod-2.2.3-pve1

Guest VM (Debian GNU/Linux KVM):

Type	Version/Name
Distribution Name	Debian GNU/Linux
Distribution Version	Bookworm (12) with Bookworm-Backports for ZFS/Kernel/Podman
Kernel Version	Linux GUEST 6.6.13+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.6.13-1~bpo12+1 (2024-02-15) x86_64 GNU/Linux
Architecture	x86_64 / amd64
OpenZFS Version	zfs-2.2.3-1\~bpo12+1 / zfs-kmod-2.2.3-1\~bpo12+1

Describe the problem you're observing

I am facing some EXTREME overhead when creating a ZFS zpool in a Proxmox VE VM Guest on top of a ZVOL on the Host.

The ZFS Pool on the Host System is also sitting on top of a LUKS / Cryptsetup Full-Disk-Encryption, although I do NOT think this is relevant for the issue described here (since the Host ZFS Pool is sitting on top of the DMCrypt / LUKS Device).

zfs list on the host:

NAME                       USED  AVAIL  REFER  MOUNTPOINT
...
rpool/data/vm-103-disk-0  8.27G   276G  8.23G  -
rpool/data/vm-103-disk-1  67.6G   276G  67.6G  -
...

zpool get all rpool Host Pool Properties:

root@pve16:/tools_nfs# zpool get all rpool
NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           920G                           -
rpool  capacity                       66%                            -
rpool  altroot                        -                              default
rpool  health                         ONLINE                         -
rpool  guid                           4485595745105166796            -
rpool  version                        -                              default
rpool  bootfs                         -                              default
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      -                              default
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupratio                     1.00x                          -
rpool  free                           307G                           -
rpool  allocated                      613G                           -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  68%                            -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  checkpoint                     -                              -
rpool  load_guid                      3306564752570665611            -
rpool  autotrim                       off                            default
rpool  compatibility                  off                            default
rpool  bcloneused                     0                              -
rpool  bclonesaved                    0                              -
rpool  bcloneratio                    1.00x                          -
rpool  feature@async_destroy          enabled                        local
rpool  feature@empty_bpobj            active                         local
rpool  feature@lz4_compress           active                         local
rpool  feature@multi_vdev_crash_dump  enabled                        local
rpool  feature@spacemap_histogram     active                         local
rpool  feature@enabled_txg            active                         local
rpool  feature@hole_birth             active                         local
rpool  feature@extensible_dataset     active                         local
rpool  feature@embedded_data          active                         local
rpool  feature@bookmarks              active                         local
rpool  feature@filesystem_limits      enabled                        local
rpool  feature@large_blocks           enabled                        local
rpool  feature@large_dnode            enabled                        local
rpool  feature@sha512                 enabled                        local
rpool  feature@skein                  enabled                        local
rpool  feature@edonr                  enabled                        local
rpool  feature@userobj_accounting     active                         local
rpool  feature@encryption             enabled                        local
rpool  feature@project_quota          active                         local
rpool  feature@device_removal         enabled                        local
rpool  feature@obsolete_counts        enabled                        local
rpool  feature@zpool_checkpoint       enabled                        local
rpool  feature@spacemap_v2            active                         local
rpool  feature@allocation_classes     enabled                        local
rpool  feature@resilver_defer         enabled                        local
rpool  feature@bookmark_v2            active                         local
rpool  feature@redaction_bookmarks    enabled                        local
rpool  feature@redacted_datasets      enabled                        local
rpool  feature@bookmark_written       active                         local
rpool  feature@log_spacemap           active                         local
rpool  feature@livelist               enabled                        local
rpool  feature@device_rebuild         enabled                        local
rpool  feature@zstd_compress          enabled                        local
rpool  feature@draid                  enabled                        local
rpool  feature@zilsaxattr             disabled                       local
rpool  feature@head_errlog            disabled                       local
rpool  feature@blake3                 disabled                       local
rpool  feature@block_cloning          disabled                       local
rpool  feature@vdev_zaps_v2           disabled                       local

zfs get all rpool/data/vm-103-disk-1 Host ZFS Properties:

NAME                      PROPERTY              VALUE                     SOURCE
rpool/data/vm-103-disk-1  type                  volume                    -
rpool/data/vm-103-disk-1  creation              Tue Apr  9 22:26 2024     -
rpool/data/vm-103-disk-1  used                  67.6G                     -
rpool/data/vm-103-disk-1  available             276G                      -
rpool/data/vm-103-disk-1  referenced            67.6G                     -
rpool/data/vm-103-disk-1  compressratio         1.84x                     -
rpool/data/vm-103-disk-1  reservation           none                      default
rpool/data/vm-103-disk-1  volsize               512G                      local
rpool/data/vm-103-disk-1  volblocksize          16K                       default
rpool/data/vm-103-disk-1  checksum              on                        default
rpool/data/vm-103-disk-1  compression           lz4                       inherited from rpool
rpool/data/vm-103-disk-1  readonly              off                       default
rpool/data/vm-103-disk-1  createtxg             2480671                   -
rpool/data/vm-103-disk-1  copies                1                         default
rpool/data/vm-103-disk-1  refreservation        none                      default
rpool/data/vm-103-disk-1  guid                  967947631676329448        -
rpool/data/vm-103-disk-1  primarycache          all                       default
rpool/data/vm-103-disk-1  secondarycache        all                       default
rpool/data/vm-103-disk-1  usedbysnapshots       15.9M                     -
rpool/data/vm-103-disk-1  usedbydataset         67.6G                     -
rpool/data/vm-103-disk-1  usedbychildren        0B                        -
rpool/data/vm-103-disk-1  usedbyrefreservation  0B                        -
rpool/data/vm-103-disk-1  logbias               latency                   default
rpool/data/vm-103-disk-1  objsetid              159445                    -
rpool/data/vm-103-disk-1  dedup                 off                       default
rpool/data/vm-103-disk-1  mlslabel              none                      default
rpool/data/vm-103-disk-1  sync                  standard                  default
rpool/data/vm-103-disk-1  refcompressratio      1.84x                     -
rpool/data/vm-103-disk-1  written               2.11M                     -
rpool/data/vm-103-disk-1  logicalused           124G                      -
rpool/data/vm-103-disk-1  logicalreferenced     124G                      -
rpool/data/vm-103-disk-1  volmode               default                   default
rpool/data/vm-103-disk-1  snapshot_limit        none                      default
rpool/data/vm-103-disk-1  snapshot_count        none                      default
rpool/data/vm-103-disk-1  snapdev               hidden                    default
rpool/data/vm-103-disk-1  context               none                      default
rpool/data/vm-103-disk-1  fscontext             none                      default
rpool/data/vm-103-disk-1  defcontext            none                      default
rpool/data/vm-103-disk-1  rootcontext           none                      default
rpool/data/vm-103-disk-1  redundant_metadata    all                       default
rpool/data/vm-103-disk-1  encryption            off                       default
rpool/data/vm-103-disk-1  keylocation           none                      default
rpool/data/vm-103-disk-1  keyformat             none                      default
rpool/data/vm-103-disk-1  pbkdf2iters           0                         default
rpool/data/vm-103-disk-1  snapshots_changed     Tue Apr  9 23:30:02 2024  -

KVM Guest / is stored on vm-103-disk-0 is ext4 on top of ZVOL for comparison purposes.
KVM Guest Containers Data (Podman) is stored on vm-103-disk-1 is the ZFS Pool on top of ZVOL which has the issue.

df -ah for / Guest VM filesystem vm-103-disk-0 (for comparison purposes) which is based on:

Filesystem                   Size  Used Avail Use% Mounted on
/dev/sda1                     30G  5.7G   23G  21% /

So 5.7G used on the Guest and 8.27G on the Host. Overhead is roughly 45% (8.27G/5.7G - 1)*100%.

zfs list on the Guest VM vm-103-disk-1 Container Data Storage (Podman):

NAME                        USED  AVAIL  REFER  MOUNTPOINT
zdata                       793M   491G    96K  /zdata
zdata/PODMAN                726M   491G   136K  /zdata/PODMAN

Overhead is roughly 8350% (67.6G/0.8G - 1)*100% !!!

zpool get all zdata Guest VM Pool Properties:

NAME   PROPERTY                       VALUE                          SOURCE
zdata  size                           508G                           -
zdata  capacity                       0%                             -
zdata  altroot                        -                              default
zdata  health                         ONLINE                         -
zdata  guid                           11398056056706436130           -
zdata  version                        -                              default
zdata  bootfs                         -                              default
zdata  delegation                     on                             default
zdata  autoreplace                    off                            default
zdata  cachefile                      -                              default
zdata  failmode                       wait                           default
zdata  listsnapshots                  off                            default
zdata  autoexpand                     off                            default
zdata  dedupratio                     1.00x                          -
zdata  free                           507G                           -
zdata  allocated                      795M                           -
zdata  readonly                       off                            -
zdata  ashift                         12                             local
zdata  comment                        -                              default
zdata  expandsize                     -                              -
zdata  freeing                        0                              -
zdata  fragmentation                  0%                             -
zdata  leaked                         0                              -
zdata  multihost                      off                            default
zdata  checkpoint                     -                              -
zdata  load_guid                      6683170848207782254            -
zdata  autotrim                       off                            default
zdata  compatibility                  off                            default
zdata  bcloneused                     0                              -
zdata  bclonesaved                    0                              -
zdata  bcloneratio                    1.00x                          -
zdata  feature@async_destroy          enabled                        local
zdata  feature@empty_bpobj            active                         local
zdata  feature@lz4_compress           active                         local
zdata  feature@multi_vdev_crash_dump  enabled                        local
zdata  feature@spacemap_histogram     active                         local
zdata  feature@enabled_txg            active                         local
zdata  feature@hole_birth             active                         local
zdata  feature@extensible_dataset     active                         local
zdata  feature@embedded_data          active                         local
zdata  feature@bookmarks              enabled                        local
zdata  feature@filesystem_limits      enabled                        local
zdata  feature@large_blocks           enabled                        local
zdata  feature@large_dnode            enabled                        local
zdata  feature@sha512                 enabled                        local
zdata  feature@skein                  enabled                        local
zdata  feature@edonr                  enabled                        local
zdata  feature@userobj_accounting     active                         local
zdata  feature@encryption             enabled                        local
zdata  feature@project_quota          active                         local
zdata  feature@device_removal         enabled                        local
zdata  feature@obsolete_counts        enabled                        local
zdata  feature@zpool_checkpoint       enabled                        local
zdata  feature@spacemap_v2            active                         local
zdata  feature@allocation_classes     enabled                        local
zdata  feature@resilver_defer         enabled                        local
zdata  feature@bookmark_v2            enabled                        local
zdata  feature@redaction_bookmarks    enabled                        local
zdata  feature@redacted_datasets      enabled                        local
zdata  feature@bookmark_written       enabled                        local
zdata  feature@log_spacemap           active                         local
zdata  feature@livelist               enabled                        local
zdata  feature@device_rebuild         enabled                        local
zdata  feature@zstd_compress          enabled                        local
zdata  feature@draid                  enabled                        local
zdata  feature@zilsaxattr             disabled                       local
zdata  feature@head_errlog            disabled                       local
zdata  feature@blake3                 disabled                       local
zdata  feature@block_cloning          disabled                       local
zdata  feature@vdev_zaps_v2           disabled                       local

zfs get all zdata Guest ZFS Properties:

NAME   PROPERTY              VALUE                     SOURCE
zdata  type                  filesystem                -
zdata  creation              Sat Dec 30 21:26 2023     -
zdata  used                  794M                      -
zdata  available             491G                      -
zdata  referenced            96K                       -
zdata  compressratio         1.42x                     -
zdata  mounted               no                        -
zdata  quota                 none                      default
zdata  reservation           none                      default
zdata  recordsize            128K                      default
zdata  mountpoint            /zdata                    local
zdata  sharenfs              off                       default
zdata  checksum              on                        default
zdata  compression           off                       local
zdata  atime                 off                       local
zdata  devices               on                        default
zdata  exec                  on                        default
zdata  setuid                on                        default
zdata  readonly              off                       default
zdata  zoned                 off                       default
zdata  snapdir               hidden                    default
zdata  aclmode               discard                   default
zdata  aclinherit            restricted                default
zdata  createtxg             1                         -
zdata  canmount              off                       local
zdata  xattr                 on                        default
zdata  copies                1                         default
zdata  version               5                         -
zdata  utf8only              off                       -
zdata  normalization         none                      -
zdata  casesensitivity       sensitive                 -
zdata  vscan                 off                       default
zdata  nbmand                off                       default
zdata  sharesmb              off                       default
zdata  refquota              none                      default
zdata  refreservation        none                      default
zdata  guid                  1402683579569969850       -
zdata  primarycache          all                       default
zdata  secondarycache        all                       default
zdata  usedbysnapshots       0B                        -
zdata  usedbydataset         96K                       -
zdata  usedbychildren        794M                      -
zdata  usedbyrefreservation  0B                        -
zdata  logbias               latency                   default
zdata  objsetid              54                        -
zdata  dedup                 off                       default
zdata  mlslabel              none                      default
zdata  sync                  standard                  default
zdata  dnodesize             legacy                    default
zdata  refcompressratio      1.00x                     -
zdata  written               0                         -
zdata  logicalused           1.00G                     -
zdata  logicalreferenced     42K                       -
zdata  volmode               default                   default
zdata  filesystem_limit      none                      default
zdata  snapshot_limit        none                      default
zdata  filesystem_count      none                      default
zdata  snapshot_count        none                      default
zdata  snapdev               hidden                    default
zdata  acltype               off                       default
zdata  context               none                      default
zdata  fscontext             none                      default
zdata  defcontext            none                      default
zdata  rootcontext           none                      default
zdata  relatime              on                        default
zdata  redundant_metadata    all                       default
zdata  overlay               on                        default
zdata  encryption            off                       default
zdata  keylocation           none                      default
zdata  keyformat             none                      default
zdata  pbkdf2iters           0                         default
zdata  special_small_blocks  0                         default
zdata  snapshots_changed     Tue Apr  9 23:30:01 2024  -

Note: it is possible that this issue is caused by block size / volblocksize or similar parameter, since Podman / Docker containers could generate lots of small files.

root@GUEST:/# find /home/podman/ -type f | wc -l
14475

That does not sound like much though ... On the GUEST recordsize is set to 128K, that's probably a bit high, isn't it ?

Regardless, just because of the number of files, that would yield: 14475 x 128K = 1852800K = 1852.8 M = 1.85 G

So probably it's causing some overhead inside the guest, but not on the level of the overhead between guest and host ...

Describe how to reproduce the problem

I don't think that it's necessary to have a VM to replicate this.

Probably just on one system (Host) it's sufficient to create a ZFS Pool on top of the ZVOL.

I disabled compression on the Guest level since it would only cause additional CPU load for no apparent benefit. Therefore compression should NOT be the cause of this huge overhead.

Notes

The idea of having a ZFS Pool on top of the ZVOL is have better control over ZFS Snapshots. In this case, after many other things will have been configured correctly, the snapshot & backup plan of rpool/data/vm-103-disk-1 could be performed by the Guest, as opposed to the host for many other VMs.

This can avoid backing up non-useful Data (such as Container Images or Container Storage) and only backup Useful / Critical Data (Containers Configuration, Secrets, Data, Certificates, Volumes, ...) thus saving a lot on Disk Space on the Backup Server.

Include any warning/errors/backtraces from the system logs

rincebrain commented 7 months ago

"TRIM"

luckylinux commented 7 months ago

It should be disabled on the host because it's ZFS on top of LUKS. That's the default behavior from what I understood at least.

However systemctl status fstrim.timer reports on the Host:

● fstrim.timer - Discard unused blocks once a week
     Loaded: loaded (/lib/systemd/system/fstrim.timer; enabled; preset: enabled)
     Active: active (waiting) since Fri 2024-04-05 10:29:41 CEST; 4 days ago
    Trigger: Mon 2024-04-15 01:37:09 CEST; 5 days left
   Triggers: ● fstrim.service
       Docs: man:fstrim

Apr 05 10:29:41 pve16 systemd[1]: Started fstrim.timer - Discard unused blocks once a week.

On the guest you might be right though

I always have SSD Emulation + Discard + IO thread enabled on all of my VMs.

But zpool get autotrim returns off both for the Host and the Guest VM.

systemctl status fstrim.timer reports on the Guest VM:

○ fstrim.timer - Discard unused blocks once a week
     Loaded: loaded (/lib/systemd/system/fstrim.timer; disabled; preset: enabled)
     Active: inactive (dead)
    Trigger: n/a
   Triggers: ● fstrim.service
       Docs: man:fstrim

Any other command to check ?

rincebrain commented 7 months ago

You misunderstood.

fstrim doesn't do anything with ZFS, and absent autotrim, it's not going to issue such requests without an explicit zpool trim in the guest, leaving the space that was freed in the guest still marked in use on the host.

luckylinux commented 7 months ago

And that's ZFS-specific ? I mean the ext4 partition for / on top the the ZVOL (like many other containers I have) do not really have this problem.

Somewhere I think I read that zpool trim is kinda dangerous concerning data loss. Isn't it ?

rincebrain commented 7 months ago

Yes, the command zpool trim is ZFS specific.

There was an uncommon race with data mangling using any kind of TRIM that was fixed in 2.2 and 2.1.14.

I wouldn't suggest using any FS that you don't want to use TRIM with inside a VM if you're worried about the space usage when things are freed and not deleted on the host.

luckylinux commented 7 months ago

What do you mean exactly by your latest statement ? That I should run zpool trim or that I should not be running ZFS on top of ZVOL ?

rincebrain commented 7 months ago

You should be using TRIM inside the VM on whatever filesystems you're using if you're worried about the delta between space reported used in the VM and space actually used on the zvol
You should, conversely, probably not run a filesystem you're not willing to use TRIM on in that VM if you're worried about that.
So if you think the aforementioned bug is a sign you shouldn't trust TRIM on ZFS, you should probably not run ZFS inside the VM.
(I don't think that's the case, personally, but you may hold a different opinion than me.)

luckylinux commented 7 months ago

Thanks.

Hopefully there won't be any regression of that bug :D.

However nothing seems to be happening.

Issued zpool trim zdata on the Guest and zpool status -t reports on the Guest:

  pool: zdata
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
    The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:07 with 0 errors on Sun Mar 10 00:24:09 2024
config:

    NAME        STATE     READ WRITE CKSUM
    zdata       ONLINE       0     0     0
      PODMAN    ONLINE       0     0     0  (100% trimmed, completed at Wed 10 Apr 2024 12:52:57 AM CEST)

errors: No known data errors

Host zfs list | grep "vm-103-disk-1" on the Host:

NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool/data/vm-103-disk-1  67.8G   274G   989M  -

Granted it could be running in the background. But for now there is absolutely no change.

rincebrain commented 7 months ago

You may note that there are 3 columns there, and "referenced" changed pretty substantially.

luckylinux commented 7 months ago

True. So I just need to destroy the old snapshots of that dataset on the Host.

luckylinux commented 7 months ago

Yep.

Now zfs list | grep "vm-103-disk-1" yields:

NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool/data/vm-103-disk-1   989M   312G   989M  -

Maybe the root filesystem of the Guest VM (ext4) has autotrim enabled by default then ?

That could explain the behavior ...

rincebrain commented 7 months ago

For reference, you could have seen if that was going to happen before doing it with zfs destroy -nv [list of snapshots] or by looking at the 4 different "usedby" properties which sum to USED.

zpool trim is for ZFS. The root FS is ext4, so fstrim will do as you expect.

luckylinux commented 7 months ago

Or if it's safe enough enable ZFS autotrim on the Guest VM ?

zpool set autotrim=on zdata

rincebrain commented 7 months ago

The bug wasn't specific to automatic or manual trim, afair, so autotrim should be no less safe than manual trim.

kwinz commented 7 months ago

There was an uncommon race with data mangling using any kind of TRIM that was fixed in 2.2 and 2.1.14.

There seems to be a new issue in 2.2.3: #16056

Just so you're aware

rincebrain commented 7 months ago

Kind of.

Note that #16056, from my quick reading, seems to be the result of using hardware that lied and gave an invalid value for how big TRIM can be, combined with a failure in error case handling for that since that should not really ever happen. #16070 fixes the latter, but it's not entirely clear what the right thing to do if I'm right about the former is.

openzfs / zfs