problame / master-thesis

1 stars 0 forks source link

ZIL - PMEM slower than NVMe #1

Open thomasklosinsky opened 1 year ago

thomasklosinsky commented 1 year ago

Hi Christian,

I have seen your presentation of the zil-pmem method (https://www.youtube.com/watch?v=2ouRsQEbusk) and I am looking for ideas why a new storage server (Supermicro SSG-640SP-E1CR60) is behaving like this:

fio on Intel Optane 200 128GB Direct bs=256k [w=1250MiB/s][w=4998 IOPS] fio on Micron 7400 1.92T NVMe Direct bs=256k [w=1849MiB/s][w=7395 IOPS]

which already is weird.

Latency is:

PMEM 8us NVMe 80us

HDD Pool contains 4x 12x Seagate Exos 16T

With 256k recordsize (and 256k blocksize) tested with fio, I get the following randwrite results:

sync=standard - no ZIL - 2x NVME mirro special VDEV = [w=9225KiB/s][w=36 IOPS] sync=standard - PMEM mirrored ZIL - 2x NVME mirro special VDEV = [w=547MiB/s][w=2189 IOPS] sync=standard - NVMe Mirrored ZIL - 2x NVME mirro special VDEV = [w=1237MiB/s][w=4949 IOPS] sync=disabled - ZIL not used - 2x NVME mirro special VDEV = [w=3721MiB/s][w=14.9k IOPS]

Of course, we would love to use sync for a storage server ;)... but could not believe that the PMEM result is that bad...

Is your solution already implemented somewhere? Would it change something in this case? I assume so...?!

Best regards, Thomas

problame commented 1 year ago

Is your solution already implemented somewhere?

No, it's not implemented anywhere yet.

Would it change something in this case? I assume so...?!

Your description is unclear about how the normal allocation class looks like, i.e., what are you doing with the HDDs. A zpool status dump says more than a thousand words :)

Also, provide the full fio config / command line for each result. For example, it's unclear whether you are doing the randwrites with --sync or not.

Finally: I don't have experience with special vdevs. The docs say:

By default this includes all metadata, the indirect blocks of user data, and any deduplication tables. The class can also be provisioned to accept small file blocks.

So, with your 256k writes, the special devices are only used for ZFS metadata. Which means,

thomasklosinsky commented 1 year ago

Hi Christian,

thank you for your answer! Too bad your solution is not implemented.

We have the following configuration now: poolname raidz2-0 ONLINE 0 0 0 scsi-35000c500d78491bf ONLINE 0 0 0 scsi-35000c500d7876883 ONLINE 0 0 0 scsi-35000c500d7847657 ONLINE 0 0 0 scsi-35000c500d73eeb17 ONLINE 0 0 0 scsi-35000c500d7891317 ONLINE 0 0 0 scsi-35000c500d7857c3f ONLINE 0 0 0 scsi-35000c500d7718e03 ONLINE 0 0 0 scsi-35000c500d785666f ONLINE 0 0 0 scsi-35000c500d781ea6b ONLINE 0 0 0 scsi-35000c500d773bc43 ONLINE 0 0 0 scsi-35000c500d785d45f ONLINE 0 0 0 scsi-35000c500d783b4df ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 scsi-35000c500d784de17 ONLINE 0 0 0 scsi-35000c500d784cdcb ONLINE 0 0 0 scsi-35000c500d79c27eb ONLINE 0 0 0 scsi-35000c500d774418b ONLINE 0 0 0 scsi-35000c500d785d077 ONLINE 0 0 0 scsi-35000c500d784072b ONLINE 0 0 0 scsi-35000c500d785be6b ONLINE 0 0 0 scsi-35000c500d7858e67 ONLINE 0 0 0 scsi-35000c500d7856e0b ONLINE 0 0 0 scsi-35000c500d785e1ff ONLINE 0 0 0 scsi-35000c500d772d58f ONLINE 0 0 0 scsi-35000c500d78aec9f ONLINE 0 0 0 raidz2-3 ONLINE 0 0 0 scsi-35000c500d784f69b ONLINE 0 0 0 scsi-35000c500d78597ff ONLINE 0 0 0 scsi-35000c500d772aa73 ONLINE 0 0 0 scsi-35000c500d8257afb ONLINE 0 0 0 scsi-35000c500d81124f3 ONLINE 0 0 0 scsi-35000c500d8255ad3 ONLINE 0 0 0 scsi-35000c500d825f54b ONLINE 0 0 0 scsi-35000c500d785bc53 ONLINE 0 0 0 scsi-35000c500d8256a13 ONLINE 0 0 0 scsi-35000c500d8251e4f ONLINE 0 0 0 scsi-35000c500d77a3eaf ONLINE 0 0 0 scsi-35000c500d785d71f ONLINE 0 0 0 raidz2-6 ONLINE 0 0 0 scsi-35000c500d8254faf ONLINE 0 0 0 scsi-35000c500d825f577 ONLINE 0 0 0 scsi-35000c500d76340cf ONLINE 0 0 0 scsi-35000c500d82516f3 ONLINE 0 0 0 scsi-35000c500d8250463 ONLINE 0 0 0 scsi-35000c500d75d1cbb ONLINE 0 0 0 scsi-35000c500d81104c7 ONLINE 0 0 0 scsi-35000c500d8254ae3 ONLINE 0 0 0 scsi-35000c500d82572db ONLINE 0 0 0 scsi-35000c500d825bb33 ONLINE 0 0 0 scsi-35000c500d784b437 ONLINE 0 0 0 scsi-35000c500d78b80a7 ONLINE 0 0 0 special mirror-4 ONLINE 0 0 0 nvme-Micron_7400_MTFDKCB1T9TDZ_2147330F443C-part1 ONLINE 0 0 0 nvme-Micron_7400_MTFDKCB1T9TDZ_2147330F44F6-part1 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 nvme-Micron_7400_MTFDKCB1T9TDZ_2147330F44FF-part1 ONLINE 0 0 0 nvme-Micron_7400_MTFDKCB1T9TDZ_22173739586D-part1 ONLINE 0 0 0 cache nvme-Micron_7400_MTFDKCB1T9TDZ_2147330F443C-part2 ONLINE 0 0 0 nvme-Micron_7400_MTFDKCB1T9TDZ_2147330F44F6-part2 ONLINE 0 0 0 nvme-Micron_7400_MTFDKCB1T9TDZ_2147330F44FF-part2 ONLINE 0 0 0 nvme-Micron_7400_MTFDKCB1T9TDZ_22173739586D-part2 ONLINE 0 0 0

errors: No known data errors

We had the PMEM and the Microns (partition 2) as ZIL, which results you've seen above.

fio command was: sync;fio --ioengine=posixaio --rw=randwrite --name=test --direct=1 --sync=1 --bs=256k --numjobs=1 --time_based --group_reporting --runtime=60 --iodepth=1 --size=32G

ZFS options: logbias=latency sync=standard or disabled (using ZIL or not)

Special device only give huge performance in listing directories (especially for samba fileservers). That doesn't make any difference in writing process, afaik.

We also talked to our dealer CTT who already had a customer a few years ago who was testing with PMEMs. They said that the pmems have a very bad performance when there is a bigger writing depth (was auch immer das genau heißt) as ZFS uses it to address the storage device. So from what I understand, your mods in zfs would change exactly that, right?

thomasklosinsky commented 1 year ago

Update on the pmem mode: sync;fio --ioengine=posixaio --rw=randwrite --name=test --direct=1 --sync=1 --bs=256k --numjobs=1 --time_based --group_reporting --runtime=60 --iodepth=1 --size=32G --filename=/dev/pmem0(s) sector: [w=1250MiB/s][w=4998 IOPS] fsdax: [w=1939MiB/s][w=7757 IOPS] raw: [w=1942MiB/s][w=7766 IOPS]

In comparison to Micron 7400 PCIe4.0 NVMe [w=1849MiB/s][w=7395 IOPS]