openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.49k stars 1.74k forks source link

Missmatch between volsize and used #3255

Closed FransUrbo closed 7 years ago

FransUrbo commented 9 years ago

I accidentally noticed this a couple of minutes ago. I've created the zvol as a sparse volume, with volsize=15G. But the USED entry say it's 67.2GB. Which should be impossible...

celia# zfs list -tall -r share/VirtualMachines/Ubuntu/Trusty/Server
NAME                                                     USED  AVAIL  REFER  MOUNTPOINT
share/VirtualMachines/Ubuntu/Trusty/Server              67.2G   255G  61.9G  -
share/VirtualMachines/Ubuntu/Trusty/Server@baseinstall  5.28G      -  16.0G  -
celia# zfs list -oname,used share/VirtualMachines/Ubuntu/Trusty/Server
NAME                                         USED
share/VirtualMachines/Ubuntu/Trusty/Server  67.2G
celia# zfs get all share/VirtualMachines/Ubuntu/Trusty/Server
NAME                                        PROPERTY                      VALUE                                                                               
share/VirtualMachines/Ubuntu/Trusty/Server  type                          volume                                                                              
share/VirtualMachines/Ubuntu/Trusty/Server  creation                      Tue Mar 31 16:29 2015                                                               
share/VirtualMachines/Ubuntu/Trusty/Server  used                          67.2G                                                                               
share/VirtualMachines/Ubuntu/Trusty/Server  available                     255G                                                                                
share/VirtualMachines/Ubuntu/Trusty/Server  referenced                    61.9G                                                                               
share/VirtualMachines/Ubuntu/Trusty/Server  compressratio                 1.30x                                                                               
share/VirtualMachines/Ubuntu/Trusty/Server  reservation                   none                                                                                
share/VirtualMachines/Ubuntu/Trusty/Server  volsize                       15G                                                                                 
share/VirtualMachines/Ubuntu/Trusty/Server  volblocksize                  512                                                                                 
share/VirtualMachines/Ubuntu/Trusty/Server  checksum                      on                                                                                  
share/VirtualMachines/Ubuntu/Trusty/Server  compression                   lz4                                                                                 
share/VirtualMachines/Ubuntu/Trusty/Server  readonly                      off                                                                                 
share/VirtualMachines/Ubuntu/Trusty/Server  copies                        1                                                                                   
share/VirtualMachines/Ubuntu/Trusty/Server  refreservation                none                                                                                
share/VirtualMachines/Ubuntu/Trusty/Server  primarycache                  none                                                                                
share/VirtualMachines/Ubuntu/Trusty/Server  secondarycache                none                                                                                
share/VirtualMachines/Ubuntu/Trusty/Server  usedbysnapshots               5.28G                                                                               
share/VirtualMachines/Ubuntu/Trusty/Server  usedbydataset                 61.9G                                                                               
share/VirtualMachines/Ubuntu/Trusty/Server  usedbychildren                0                                                                                   
share/VirtualMachines/Ubuntu/Trusty/Server  usedbyrefreservation          0                                                                                   
share/VirtualMachines/Ubuntu/Trusty/Server  logbias                       latency                                                                             
share/VirtualMachines/Ubuntu/Trusty/Server  dedup                         off                                                                                 
share/VirtualMachines/Ubuntu/Trusty/Server  mlslabel                      none                                                                                
share/VirtualMachines/Ubuntu/Trusty/Server  sync                          standard                                                                            
share/VirtualMachines/Ubuntu/Trusty/Server  refcompressratio              1.32x                                                                               
share/VirtualMachines/Ubuntu/Trusty/Server  written                       51.2G                                                                               
share/VirtualMachines/Ubuntu/Trusty/Server  logicalused                   8.59G                                                                               
share/VirtualMachines/Ubuntu/Trusty/Server  logicalreferenced             8.05G                                                                               
share/VirtualMachines/Ubuntu/Trusty/Server  snapdev                       hidden                                                                              
share/VirtualMachines/Ubuntu/Trusty/Server  context                       none                                                                                
share/VirtualMachines/Ubuntu/Trusty/Server  fscontext                     none                                                                                
share/VirtualMachines/Ubuntu/Trusty/Server  defcontext                    none                                                                                
share/VirtualMachines/Ubuntu/Trusty/Server  rootcontext                   none                                                                                
share/VirtualMachines/Ubuntu/Trusty/Server  redundant_metadata            all                                                                                 
share/VirtualMachines/Ubuntu/Trusty/Server  shareiscsi                    initiator=iqn.1993-08.org.debian:01:e19b61b8377;iqn.2009-08.com.sun.virtualbox.initiator:01:192.168.69.3,blocksize=512  local
share/VirtualMachines/Ubuntu/Trusty/Server  com.sun:auto-snapshot:weekly  false                                                                                                                   inherited from share/VirtualMachines/Ubuntu
share/VirtualMachines/Ubuntu/Trusty/Server  com.sun:auto-snapshot:daily   false                                                                                                                   inherited from share/VirtualMachines/Ubuntu

on the host:

UbuntuTrustyServer:/# df -h -t ext4
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        15G  2.3G   12G  17% /
/dev/sdb1        50G  840M   46G   2% /usr/src

So the host thinks it only uses 2.3G, but ZFS think it's using 67G!

This is a ZVOL, shared via iSCSI (SCST) to another host, where it's connected to VBox which then 'shares' it to the host as a SATA device…

Writing zeros to a file on the host makes the USED to drop and with a little luck it's going to be ok once I've filled the disk (on the host). But there seems to be a bug/issue here somewhere…

This is kernel 3.18.0-rc3 with spl/zfs GIT master (with a bunch of other bits and pieces added - https://github.com/FransUrbo/zfs/blob/FAVORITES/README.turbo)

FransUrbo commented 9 years ago

That don't seem to work. After a while, the server just dies… I don't know if it's the "stress" (my SAS controllers haven't been rock-solid - the driver isn't really any good).

I have to try to copy the data onto another ZVOL and destroy this one...

FransUrbo commented 9 years ago

Looking through a bunch of my ZVOLs, I noticed that there's several that show the same strange behavior:

Negotia, which is NOT a VM, have /dev/sdc via iSCSI:

Negotia:~# df -h -text4
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       111G   15G   91G  14% /Machines
/dev/sdc1       504G  8.7G  470G   2% /Machines/Machines

However, the ZVOL shows:

celia# zfs list share/VirtualMachines/Negotia/Machines
NAME                                     USED  AVAIL  REFER  MOUNTPOINT
share/VirtualMachines/Negotia/Machines  83.5G   266G  83.5G  -

Getting another list:

celia# zfs list -oname,volsize,used -tvolume -r share/VirtualMachines | egrep 'G$' | sort -n -k3 | tail
share/VirtualMachines/Ubuntu/Utopic/Server                     15G  16.3G
share/VirtualMachines/Windows/7                                15G  16.4G
share/VirtualMachines/Ubuntu/Vivid/Server                      15G  17.0G
share/VirtualMachines/Ubuntu/Saucy/Server                      15G  17.9G
share/VirtualMachines/Windows/Vista                            15G  19.3G
share/VirtualMachines/Debian/Lenny/32_Source                   25G  23.1G
share/VirtualMachines/Debian/Sid/Src                           25G  23.3G
share/VirtualMachines/Debian/Lenny/32_Source2                  40G  30.5G
share/VirtualMachines/Ubuntu/Trusty/Server                     15G  51.7G
share/VirtualMachines/Negotia/Machines                        512G  83.5G

All these (except the last one) is base installed only. Nothing fancy… The majority of them have gone over the volsize.

FransUrbo commented 9 years ago

All my ZVOLs uses volblocksize=512, except share/VirtualMachines/Negotia/Machines which uses 8K.

But going back to the original ZVOL (share/VirtualMachines/Ubuntu/Trusty/Server), I created a new ZVOL (with the extension .new), mounted that in the VM, partitioned it (GPT) and put ext3 on it. Mounted it and copied it (using find -mount | cpio -vpmd --preserve-modification-time), it is now STILL bigger than what the VM thinks:

UbuntuTrustyServer:/usr/src# df -h -t ext4 -t ext3
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        15G  2.1G   12G  15% /
/dev/sdc1        50G  838M   46G   2% /usr/src
/dev/sdb1        14G  2.3G   11G  18% /mnt

(sda1 is the 'original', 'weird' ZVOL and sdb is the new).

Notice that that mismatches to - sda1 and sdb1 should be identical now!

and the zvol:

celia# zfs list -oname,used share/VirtualMachines/Ubuntu/Trusty/Server share/VirtualMachines/Ubuntu/Trusty/Server.new
NAME                                             USED
share/VirtualMachines/Ubuntu/Trusty/Server      51.9G
share/VirtualMachines/Ubuntu/Trusty/Server.new  4.11G

Almost twice of what the VM thinks...

FransUrbo commented 9 years ago

I decided to destroy share/VirtualMachines/Ubuntu recursively and start over… Seems like the simplest to do at this point.

GregorKopka commented 9 years ago

I suspect volblocksize=512 leads to metadata inflation, but a factor of 4 for metadata (in case the zvol has been written with non-zero data completely once) seems a bit excessive to me.

behlendorf commented 9 years ago

My guess is that this is caused by the large number of indirect blocks required to manage a 512b ZVOL. But I'd need to dig in to it to say more for certain.

FransUrbo commented 9 years ago

My guess is that this is caused by the large number of indirect blocks req

"indirect blocks req"?

behlendorf commented 9 years ago

Sorry about that, I accidentally hit commit mid sentence. I've updated the comment.

Hydrar commented 9 years ago

I just hit this on current HEAD without additional patches using the 9999 gentoo ebuild and 4k volblocksize This is how it looks after overwriting it with 16GB of 50% random 50% zero data chunks using fio

NAME                                                      PROPERTY              VALUE                  SOURCE
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  type                  volume                 -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  creation              fre mar 13  1:38 2015  -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  used                  26.0G                  -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  available             2.41T                  -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  referenced            26.0G                  -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  compressratio         2.65x                  -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  reservation           none                   default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  volsize               16G                    local
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  volblocksize          4K                     -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  checksum              on                     default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  compression           lz4                    local
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  readonly              off                    default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  copies                1                      default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  refreservation        none                   default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  primarycache          all                    default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  secondarycache        all                    default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  usedbysnapshots       0                      -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  usedbydataset         26.0G                  -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  usedbychildren        0                      -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  usedbyrefreservation  0                      -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  logbias               latency                default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  dedup                 off                    default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  mlslabel              none                   default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  sync                  disabled               local
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  refcompressratio      2.65x                  -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  written               26.0G                  -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  logicalused           16.0G                  -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  logicalreferenced     16.0G                  -
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  snapdev               hidden                 default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  context               none                   default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  fscontext             none                   default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  defcontext            none                   default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  rootcontext           none                   default
stuff_array/servers/iscsi_images/hydrar-desktop-swap-new  redundant_metadata    all                    default
gmelikov commented 7 years ago

Close as stale.

If it's actual - feel free to reopen.

mheubach commented 6 years ago

Hmm. I went into the same trouble. Raidz2 with 8x8TB. Physical Sector size of HDDs: 4096 bytes ZFS is version 0.7.9 from Promox VE

I created a volume with 10TB. The volume is ext4 formatted and mounted. I simple dd /dev/urandom into an single file and after a while the volumes looks like this:

pool1/test  type                  volume                 -
pool1/test  creation              Mon Aug  6 18:26 2018  -
pool1/test  used                  10.6T                  -
pool1/test  available             39.9T                  -
pool1/test  referenced            71.9G                  -
pool1/test  compressratio         1.00x                  -
pool1/test  reservation           none                   default
pool1/test  volsize               10T                    local
pool1/test  volblocksize          4K                     -
pool1/test  checksum              on                     default
pool1/test  compression           off                    local
pool1/test  readonly              off                    default
pool1/test  createtxg             977                    -
pool1/test  copies                1                      default
pool1/test  refreservation        10.6T                  local
pool1/test  guid                  98032921592636052      -
pool1/test  primarycache          all                    default
pool1/test  secondarycache        all                    default
pool1/test  usedbysnapshots       0B                     -
pool1/test  usedbydataset         71.9G                  -
pool1/test  usedbychildren        0B                     -
pool1/test  usedbyrefreservation  10.6T                  -
pool1/test  logbias               latency                default
pool1/test  dedup                 off                    default
pool1/test  mlslabel              none                   default
pool1/test  sync                  standard               default
pool1/test  refcompressratio      1.00x                  -
pool1/test  written               71.9G                  -
pool1/test  logicalused           33.7G                  -
pool1/test  logicalreferenced     33.7G                  -
pool1/test  volmode               default                default
pool1/test  snapshot_limit        none                   default
pool1/test  snapshot_count        none                   default
pool1/test  snapdev               hidden                 default
pool1/test  context               none                   default
pool1/test  fscontext             none                   default
pool1/test  defcontext            none                   default
pool1/test  rootcontext           none                   default
pool1/test  redundant_metadata    all                    default

Writing into a dataset however works as it should:


NAME         PROPERTY              VALUE                  SOURCE
pool1/test2  type                  filesystem             -
pool1/test2  creation              Mon Aug  6 18:31 2018  -
pool1/test2  used                  22.5G                  -
pool1/test2  available             29.3T                  -
pool1/test2  referenced            22.5G                  -
pool1/test2  compressratio         1.00x                  -
pool1/test2  mounted               yes                    -
pool1/test2  quota                 none                   default
pool1/test2  reservation           none                   default
pool1/test2  recordsize            128K                   default
pool1/test2  mountpoint            /pool1/test2           default
pool1/test2  sharenfs              off                    default
pool1/test2  checksum              on                     default
pool1/test2  compression           off                    local
pool1/test2  atime                 on                     default
pool1/test2  devices               on                     default
pool1/test2  exec                  on                     default
pool1/test2  setuid                on                     default
pool1/test2  readonly              off                    default
pool1/test2  zoned                 off                    default
pool1/test2  snapdir               hidden                 default
pool1/test2  aclinherit            restricted             default
pool1/test2  createtxg             1038                   -
pool1/test2  canmount              on                     default
pool1/test2  xattr                 on                     default
pool1/test2  copies                1                      default
pool1/test2  version               5                      -
pool1/test2  utf8only              off                    -
pool1/test2  normalization         none                   -
pool1/test2  casesensitivity       sensitive              -
pool1/test2  vscan                 off                    default
pool1/test2  nbmand                off                    default
pool1/test2  sharesmb              off                    default
pool1/test2  refquota              none                   default
pool1/test2  refreservation        none                   default
pool1/test2  guid                  3232922925337121303    -
pool1/test2  primarycache          all                    default
pool1/test2  secondarycache        all                    default
pool1/test2  usedbysnapshots       0B                     -
pool1/test2  usedbydataset         22.5G                  -
pool1/test2  usedbychildren        0B                     -
pool1/test2  usedbyrefreservation  0B                     -
pool1/test2  logbias               latency                default
pool1/test2  dedup                 off                    default
pool1/test2  mlslabel              none                   default
pool1/test2  sync                  standard               default
pool1/test2  dnodesize             legacy                 default
pool1/test2  refcompressratio      1.00x                  -
pool1/test2  written               22.5G                  -
pool1/test2  logicalused           22.5G                  -
pool1/test2  logicalreferenced     22.5G                  -
pool1/test2  volmode               default                default
pool1/test2  filesystem_limit      none                   default
pool1/test2  snapshot_limit        none                   default
pool1/test2  filesystem_count      none                   default
pool1/test2  snapshot_count        none                   default
pool1/test2  snapdev               hidden                 default
pool1/test2  acltype               off                    default
pool1/test2  context               none                   default
pool1/test2  fscontext             none                   default
pool1/test2  defcontext            none                   default
pool1/test2  rootcontext           none                   default
pool1/test2  relatime              off                    default
pool1/test2  redundant_metadata    all                    default
pool1/test2  overlay               off                    default

Does anybody have an idea about this?

Best regards Manfred

gmelikov commented 6 years ago

@mheubach please write the exact command you used to create zvol.

mheubach commented 6 years ago

zfs create -V 10T -o volblocksize=4k -o compression=off pool1/test

pool has been created with this command:

zpool create -o ashift=12 -O compression=on pool1 raidz2 sda sdb sdc sdd sde sdf sdg sdh

richardelling commented 6 years ago

@mheubach the accounting looks as expected, what exactly is your question?

Also, know that raidz2 with ashift=12 and volblocksize=4k is not space efficient

mheubach commented 6 years ago

In the meantime I read about raidz2 space efficiency. I will play around a bit. The idea behind volblocksize=4k was to use the same blocksize with ext4. I expected some overhead but that the zvol consumes twice as much data as actually written to it came as a surprise. I will verify this against a pool with raidz instead of raidz2.

mheubach commented 6 years ago

Ok. I made some tests. Pool is raidz2 again (8 disks). I have 3 ZVOLs with different volblocksizes. volblocksize=128k produces nearly no overhead but will for sure cause a lot of IO when copy on write comes into action. Smaller volblocksizes cause more overhead. The overhead is huge with volblocksize < 16k, drops immediately to about 7% with volblocksize=16, 32 or 64k and to 0 with volblocksize=128k. Is there any magic mathematics for calculation the best volblocksize depending on ashift, number of discs, raid level, ...? My use case here is archiving of already compressed data over a "slow" WAN link. So performance and zfs compression is not my concern and I will opt for smallest overhead. Anyway for me this behaviour is at least peculiar :-)

NAME         VOLBLOCK  LREFER  REFER
pool1/test5        8K   16.7G  35.7G  (213% overhead)
pool1/test2       16K   89.4G  95.6G (6,93% overhead
pool1/test3       32K   89.9G  96.0G (6,78% overhead)
pool1/test4       64K   50.1G  53.5G (6,78% overhead)
pool1/test       128K   60.7G  60.7G (0% overhead)
richardelling commented 6 years ago

200% overhead is correct and predictable.

https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz