letsencrypt / openzfs-nvme-databases

Creative Commons Zero v1.0 Universal
572 stars 36 forks source link

ZFS datastore for MariaDB

This documents the settings we use at Let's Encrypt to create ZFS backing storage for MariaDB, and the tips and best practices that led us here.

Storage priorities

Our priorities are, in order:

  1. Integrity
  2. Performance
  3. Durability

Our data must be correct. Integrity is our service's most important property. We mustn't change settings which might defeat ZFS's inherent high integrity.

We've had stubborn performance problems because of our database's large size, schema, and access patterns. We ran out of improvements to those properties that we could make in the short term. This made it a high priority to wring as much performance out of our database servers as we safely could.

Our primary database server rapidly replicates to two others, including two locations, and is backed up daily. The most business- and compliance-critical data is also logged separately, outside of our database stack. As long as we can maintain durability for long enough to evacuate the primary (write) role to a healthier database server, that is enough.

Accordingly, we're tuning ZFS and MariaDB for performance over durability, avoiding only the more dangerous tradeoffs.

Preparing the drives

Many modern storage drives can present different sector sizes (LBA formats) to the host system. Only one (or none) will be their internal, best-performing sector size. This is often the largest sector size they can natively support, e.g. "4Kn."^1^3 We've used some Solidigm (formerly Intel) NVMe drives, which have a changeable "variable sector size."^5 The online documentation and specifications didn't list the sector size options for the P4610 model we use, but scanning it showed us two possible values: 0 (512B) or 1 (4KB). flashbench^6 results strongly suggest that the internal sector size is 8KB.

Implementation

We use the Solidigm Storage Tool^8 to set the Variable Sector Size to 4,096, the best-performing of the available options.

WARNING: This erases all data.

for driveIndex in {0..23}; do
    sudo sst start                  \
        -ssd ${driveIndex}          \
        -nvmeformat                 \
            LBAFormat=1             \
            SecureEraseSetting=0    \
            ProtectionInformation=0 \
            MetadataSettings=0

    sudo sst show                   \
        -display SectorSize         \
        -ssd ${driveIndex}
done

^5

ZFS kernel module settings

Building the vdevs, pool & datasets

Basic concepts, from the bottom up^11

Vdevs & pool

Implementation

sudo zpool create                          \
    -o ashift=13                           \
    -o autoreplace=on                      \
    db01                                   \
    mirror                                 \
        /dev/disk/by-id/nvme-P4610_Drive01 \
        /dev/disk/by-id/nvme-P4610_Drive02 \
    mirror                                 \
        /dev/disk/by-id/nvme-P4610_Drive03 \
        /dev/disk/by-id/nvme-P4610_Drive04 \
# [etc.]                                   \
    mirror                                 \
        /dev/disk/by-id/nvme-P4610_Drive21 \
        /dev/disk/by-id/nvme-P4610_Drive22 \
    spare                                  \
        /dev/disk/by-id/nvme-P4610_Drive23 \
        /dev/disk/by-id/nvme-P4610_Drive24

Parent dataset

Implementation

sudo zfs create              \
    -o mountpoint=/datastore \
    -o atime=off             \
    -o compression=lz4       \
    -o dnodesize=auto        \
    -o primarycache=metadata \
    -o recordsize=128k       \
    -o xattr=sa              \
    -o acltype=posixacl      \
    db01/mysql

sudo zfs get acltype

InnoDB child dataset

Implementation

sudo zfs create                 \
    -o mountpoint=/datastore/db \
    -o logbias=throughput       \
    -o recordsize=16k           \
    -o redundant_metadata=most  \
    db01/mysql/innodb

MariaDB settings

Operations

References