openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.33k stars 1.72k forks source link

Unbalanced data distribution across (raidz2) vdevs #14858

Open AllKind opened 1 year ago

AllKind commented 1 year ago

System information

Type Version/Name
Distribution Name Linux Mint
Distribution Version 21.1
Kernel Version 5.15.111-custom
Architecture amd64
OpenZFS Version zfs-2.1.11-1, zfs-kmod-2.1.11-1 (dkms) (compiled from source and license set to GPL)

Describe the problem you're observing

Short summary: I have one pool containing two raidz2 vdevs. One vdev has 8 x 8TB hdds, the other 8 x 14TB hdds. Now the problem is, that I observe data distribution is not equally balanced between the two vdevs.

This is my pool layout:

  pool: storage
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: resilvered 18.9G in 00:01:52 with 0 errors on Thu Feb  9 11:50:49 2023
config:

        NAME                                             STATE     READ WRITE CKSUM
        storage                                          ONLINE       0     0     0
          raidz2-0                                       ONLINE       0     0     0
            scsi-2001b4d201546fca5                       ONLINE       0     0     0
            scsi-2001b4d202222dd4d                       ONLINE       0     0     0
            scsi-2001b4d206982375c                       ONLINE       0     0     0
            scsi-2001b4d207ad7e2ae                       ONLINE       0     0     0
            scsi-2001b4d20b76459ca                       ONLINE       0     0     0
            scsi-2001b4d20b8b1e239                       ONLINE       0     0     0
            scsi-2001b4d20c7791a7a                       ONLINE       0     0     0
            scsi-2001b4d20f666a67f                       ONLINE       0     0     0
          raidz2-1                                       ONLINE       0     0     0
            ata-WDC_WD80EFAX-68KNBN0_VAJ1UU1L            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68KNBN0_VAJ1YS2L            ONLINE       0     0     0
            ata-WDC_WD80EFAX-68KNBN0_VDJW9S9D            ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VK094RAY            ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VK0GXYLY            ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VK0J9X1Y            ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VK0MU6WY            ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VK0MVSHY            ONLINE       0     0     0
        logs
          nvme-Force_MP600_20028229000128552381-part5    ONLINE       0     0     0
        cache
          ata-Samsung_SSD_850_PRO_256GB_S39KNX0HA64123D  ONLINE       0     0     0
          ata-Samsung_SSD_850_PRO_256GB_S39KNX0HA05694M  ONLINE       0     0     0

errors: No known data errors

I'm just a "regular user" and I store mostly video content on my pool. Originally I had only the 8 x 8TB WDC drives (internal). When space was getting short, I bought an external enclosure (connected over Thunderbolt 3) and the 8 x 14TB Seagate drives. With the help of people in zfsonlinux.topicbox.com/groups/zfs-discuss I migrated the data into this new pool with the following procedure: 1: copy all data to a new pool on the new enclosure. 2: delete the old pool and create a new vdev on the new pool.

From what I've read and was told at zfs-discuss, zfs should balance distribution of the data. But for a while now I observe the gap growing. It looks to me there is more data written to the bigger vdev (raidz2-0), which actually has less free space left. This is the output of zpool list -v taken 3 times over the last couple of weeks:

NAME                                              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storage                                           160T  88.9T  71.2T        -         -     1%    55%  1.00x    ONLINE  -
  raidz2-0                                        102T  69.3T  32.5T        -         -     1%  68.1%      -    ONLINE
    scsi-2001b4d201546fca5                       12.7T      -      -        -         -      -      -      -    ONLINE
    scsi-2001b4d202222dd4d                       12.7T      -      -        -         -      -      -      -    ONLINE
    scsi-2001b4d206982375c                       12.7T      -      -        -         -      -      -      -    ONLINE
    scsi-2001b4d207ad7e2ae                       12.7T      -      -        -         -      -      -      -    ONLINE
    scsi-2001b4d20b76459ca                       12.7T      -      -        -         -      -      -      -    ONLINE
    scsi-2001b4d20b8b1e239                       12.7T      -      -        -         -      -      -      -    ONLINE
    scsi-2001b4d20c7791a7a                       12.7T      -      -        -         -      -      -      -    ONLINE
    scsi-2001b4d20f666a67f                       12.7T      -      -        -         -      -      -      -    ONLINE
  raidz2-1                                       58.2T  19.5T  38.7T        -         -     1%  33.5%      -    ONLINE
    ata-WDC_WD80EFAX-68KNBN0_VAJ1UU1L            7.28T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD80EFAX-68KNBN0_VAJ1YS2L            7.28T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD80EFAX-68KNBN0_VDJW9S9D            7.28T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD80EFZX-68UW8N0_VK094RAY            7.28T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD80EFZX-68UW8N0_VK0GXYLY            7.28T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD80EFZX-68UW8N0_VK0J9X1Y            7.28T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD80EFZX-68UW8N0_VK0MU6WY            7.28T      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD80EFZX-68UW8N0_VK0MVSHY            7.28T      -      -        -         -      -      -      -    ONLINE

NAME                                              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storage                                           160T  89.6T  70.5T        -         -     1%    55%  1.00x    ONLINE  -
  raidz2-0                                        102T  69.8T  32.0T        -         -     1%  68.6%      -    ONLINE
  ...
  raidz2-1                                       58.2T  19.7T  38.5T        -         -     1%  33.9%      -    ONLINE
  ...

NAME                                              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storage                                           160T  91.5T  68.6T        -         -     1%    57%  1.00x    ONLINE  -
  raidz2-0                                        102T  71.2T  30.7T        -         -     1%  69.9%      -    ONLINE
  ...
  raidz2-1                                       58.2T  20.3T  37.9T        -         -     1%  34.8%      -    ONLINE
  ...

So if it's true, that new data should be distributed equally, meaning balancing the available space on each vdev to an approximately equal value, then I would tend to call the observed behavior a bug. If not, then I'm sorry for creating a bug report. But still I would like to ask for help on how to get the data/free space balanced across my vdevs?

For completeness: The pool was created with zfs v.0.8x and I did not yet run a feature upgrade on it. The output of zpool get all:

NAME     PROPERTY                       VALUE                          SOURCE
storage  size                           160T                           -
storage  capacity                       57%                            -
storage  altroot                        -                              default
storage  health                         ONLINE                         -
storage  guid                           11570478517308238918           -
storage  version                        -                              default
storage  bootfs                         -                              default
storage  delegation                     on                             default
storage  autoreplace                    on                             local
storage  cachefile                      -                              default
storage  failmode                       wait                           default
storage  listsnapshots                  on                             local
storage  autoexpand                     off                            default
storage  dedupratio                     1.00x                          -
storage  free                           68.6T                          -
storage  allocated                      91.5T                          -
storage  readonly                       off                            -
storage  ashift                         12                             local
storage  comment                        -                              default
storage  expandsize                     -                              -
storage  freeing                        0                              -
storage  fragmentation                  1%                             -
storage  leaked                         0                              -
storage  multihost                      off                            default
storage  checkpoint                     -                              -
storage  load_guid                      12541147851607432017           -
storage  autotrim                       off                            default
storage  compatibility                  off                            default
storage  feature@async_destroy          enabled                        local
storage  feature@empty_bpobj            active                         local
storage  feature@lz4_compress           active                         local
storage  feature@multi_vdev_crash_dump  enabled                        local
storage  feature@spacemap_histogram     active                         local
storage  feature@enabled_txg            active                         local
storage  feature@hole_birth             active                         local
storage  feature@extensible_dataset     active                         local
storage  feature@embedded_data          active                         local
storage  feature@bookmarks              enabled                        local
storage  feature@filesystem_limits      enabled                        local
storage  feature@large_blocks           active                         local
storage  feature@large_dnode            enabled                        local
storage  feature@sha512                 enabled                        local
storage  feature@skein                  active                         local
storage  feature@edonr                  enabled                        local
storage  feature@userobj_accounting     active                         local
storage  feature@encryption             active                         local
storage  feature@project_quota          active                         local
storage  feature@device_removal         enabled                        local
storage  feature@obsolete_counts        enabled                        local
storage  feature@zpool_checkpoint       enabled                        local
storage  feature@spacemap_v2            active                         local
storage  feature@allocation_classes     enabled                        local
storage  feature@resilver_defer         enabled                        local
storage  feature@bookmark_v2            enabled                        local
storage  feature@redaction_bookmarks    disabled                       local
storage  feature@redacted_datasets      disabled                       local
storage  feature@bookmark_written       disabled                       local
storage  feature@log_spacemap           disabled                       local
storage  feature@livelist               disabled                       local
storage  feature@device_rebuild         disabled                       local
storage  feature@zstd_compress          disabled                       local
storage  feature@draid                  disabled                       local

The output of zfs list -o space,compression,compressratio,encryption,recordsize,mountpoint,mounted -t filesystem:

NAME                        AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  COMPRESS        RATIO  ENCRYPTION   RECSIZE  MOUNTPOINT          MOUNTED
storage                     48.6T  65.0T        0B    427K             0B      65.0T  off             1.00x  off             128K  none                no
storage/backup_games        48.6T   192G        0B    205K             0B       192G  off             1.00x  off              32K  none                no
storage/backup_games/Games  48.6T   192G        0B    192G             0B         0B  off             1.00x  off              32K  none                no
storage/base                48.6T   875G     4.22G    871G             0B         0B  off             1.00x  off             128K  /storage            yes
storage/game                48.6T  1.02T        0B   1.02T             0B         0B  off             1.00x  off               1M  /storage/pub/game   yes
storage/music               48.6T  1.36T     26.3M   1.36T             0B         0B  off             1.00x  off             512K  /storage/pub/music  yes
storage/safe                48.6T  82.5G     3.25M   82.5G             0B         0B  lz4             1.00x  aes-256-gcm     128K  /storage/safe       yes
storage/video               48.6T  59.6T     1.64T   57.9T             0B         0B  off             1.00x  off               1M  /storage/pub/video  yes

Thank you very much in advance for your help (and of course a big thank you to all developers and contributors)!

Sawtaytoes commented 9 months ago

This looks right to me. I might be completely wrong, but from what I've seen, ZFS will use up as much space as it can until a drive reaches 80% or so, and then it slowly drips in more data.

Depending on the ZFS version, one of the devs was saying it would add data based on the speed it could write that data, so faster drives got priority. He also said that as a drive fills up, it slows down, so drives with more capacity would start being utilized more, and you'd have this uneven, but fast distribution of data.

I don't have any sources. I remember hearing it in an OpenZFS talk.

AllKind commented 9 months ago

@Sawtaytoes Thank you for your answer. After I created this post, I read in another issue here a comment from amotin, confirming the priority for faster drivers. Which in my case are the scsi-WHATEVER named drives. I didn't know about the 80% threshold. Would be nice if a developer could confirm that.

AlistairMcCutcheonIAS commented 7 months ago

@Sawtaytoes Thank you for your answer. After I created this post, I read in another issue here a comment from amotin, confirming the priority for faster drivers. Which in my case are the scsi-WHATEVER named drives. I didn't know about the 80% threshold. Would be nice if a developer could confirm that.

I second this - I'd be very interested to see some official documentation explaining how zfs writes to virtual devices (especially unbalanced virtual devices), rather than word of mouth.

Sawtaytoes commented 7 months ago

I wish I had a source. I got this from watching an OpenZFS talk, and I have no memory of the same.

All my zpools use the same types of drives, so I can't verify. It's a rarer scenario to have HDD and SSD mixed as data vdevs.

I bet there are some tuning parameters you can tweak though.

AlistairMcCutcheonIAS commented 7 months ago

I'm using raidz2 with 7 14TB drives, and 7 16TB drives - the 14TB vdev is at 90% capacity, the 16TB vdev is at 50% capacity. Writing 157 GB of data (after compression), 83% went to the 16TB vdev and 13% went to the 14TB vdev.

The 14TB drives were operating at 98% utilisation, the 16TB drives were operating at 60% utilisation.

Sawtaytoes commented 7 months ago
  1. Are those both HDD vdevs? If assume so :p.
  2. Are some SMR rather than CMR?
  3. Did you add one vdev later after already adding data to your zpool?
AlistairMcCutcheonIAS commented 7 months ago
  1. It would be hard finding a 14TB ssd haha
  2. All drives are CMR
  3. The 16TB vdev was added after the 14TB vdev was at 85% capacity (or thereabouts)
AllKind commented 7 months ago

@AlistairMcCutcheonIAS How about rotation speed? Same, or different for the 14T and 16T drives?

My smaller vdev drivers rotate at 5400, the bigger ones at 7200. It would be nice to have some tunable(s) there. Something like: prefer-fast-drives = on|off, balance-distribution-threshold = XX% (percent), prefer-vdev-with-more-space = XX% (percent). Just ideas...

Wish to Santa: Bring us Block Pointer Rewrite! ;-)