utterances-bot commented 3 years ago

Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation | Tanel Poder Consulting

TL;DR Modern disks are so fast that system performance bottleneck shifts to RAM access and CPU. With up to 64 cores, PCIe 4.0 and 8 memory channels, even a single-socket AMD ThreadRipper Pro workstation makes a hell of a powerful machine - if you do it right! Introduction In this post I’ll explain how I configured my AMD ThreadRipper Pro workstation with 10 PCIe 4.0 SSDs to achieve 11M IOPS with 4kB random reads and 66 GiB/s throughput with larger IOs - and what bottlenecks & issues I fixed to get there. - Linux, Oracle, SQL performance tuning and troubleshooting - consulting & training.

https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/

YOUDIEMOFO commented 3 years ago

Bad ass!

What are the chances of a Windows system Being optimized similarly and attaining similar results?? I am without saying totally ignorant when it comes to Linux unfortunately..

I do have similar hardware and would love to know of and or if there are ways to optimize a windows 10 system.

thomas-biesmans commented 3 years ago

Awesome deep-dive! Very interesting read, looking forward to those future articles!

00pauln00 commented 3 years ago

Hi Tanel, Thanks for sharing this analysis. I was wondering if you'd tried the test without the --fixedbufs option? On my system, I'm seeing a marginal latency decrease when using fixed buffers, however, I don't have enough devices or CPU threads to approach my machine's memory bandwidth capacity. Do you know to what degree, if any, your performance is affected if --fixedbufs is removed?

peacefellow commented 3 years ago

Well done! Learned a good deal from your experience and excellent article: mainly the placement of the hyper cards as an important concern. Lenovo’s forum shows some concerns with 2TB ssds— available then pulled due to an issue. I’ve got the asus hyper card for PCIe v4 and will soon have some 980 Pro (tho I don’t like the change in hand from the 970 pro).

acollaborator commented 3 years ago

Thank you for sharing so much great information in this millenium.

Does the 256 GB in your P620 consist 4 x 64 GB or 8 x 32 GB RDIMMs?

I suspect the 12-core 3945WX and the 16-core 3955WX are contructed like the Epyc 7272 and 7282 that have 4 memory channels from the SoC I/O controller to the cores rather than 8 because of having 2 CCDs per package instead of 4. The L3 cache size of 64 MB is the same among all of these. If this is the case, I think that 100 GB/s throughput to main memory rather 200 GB/s for these SKUs is a fine value trade-off. I just wish I could find reference material from AMD that describes if this is the case.

Here is a some discussion related to this. https://www.servethehome.com/amd-epyc-7002-rome-cpus-with-half-memory-bandwidth/

tanelpoder commented 3 years ago

I used 8 x 32 GB DIMMs. Yep I have seen this article too and it makes me wonder. Once the 32 core CPUs are available via direct channels (cheaper), I'm sure I'll be tempted to get one to get to full memory bandwidth. I have yet to run a test where I actually touch the memory lines after loading to cache, that may tell us more about the internal memory bandwidth/performance. Thanks for the comment!

tanelpoder commented 3 years ago

I guess I could test this by just taking 4 DIMMs out and testing if the performance/throughput drops noticeably.

tanelpoder commented 3 years ago

I ran mlc in a video about this experiment, I haven't checked how thorough & optimized this tool is, but it reported ~81.8 GB/s of internal throughput: https://youtu.be/5A531KE8O9Q?t=3833

acollaborator commented 3 years ago

Writing this comment is part of my process of learning about infinity fabric connections and throughput.

~81.8 GB/s of internal throughput is similar to a 4 memory channel Threadripper measurement Ian Cuttress gathered using the Windows utility AIDA64 at https://www.anandtech.com/bench/product/2766?vs=2631

Here is a summary of Threadripper results by number of populated DDR4 channels from Anandtech.

model   #  Read     Write    Copy   (MB/s)
3995WX  8  155264   152976   165361
3990X   4   85076    82518    81110
3995WX  2   45469    41743    44813

Zen2 consists of groups of 1 to 4 cores that share distinct 16 MB Level 3 caches. Each distinct Level 3 cache can transfer at a rate equivalent to one DDR4 channel (25600 MT at DDR4-3200). Each chiplet(CCD) has two 16 MB L3 caches and each of those caches is shared by 1 to 4 cores (depending on sku). Therefore, maximum throughput of a chiplet(CCD) is ~50 GB/s. Perhaps one CCD (a chiplet with 1 to 8 cores depending on sku) is the equivalent of the smallest definable NUMA node.

If I am understanding what I have read, PCIe data transfer from any single Zen2 core with 1600 MHz infinity fabric has a ceiling of ~22 GB/s. You mentioned sdk.io reaching 11 GB/s using one core. Neat. Pedal to the almost bare metal.

Resources I am reading to understand this. https://en.wikichip.org/wiki/File:amd_if_ifop_link_example.svg. https://en.wikichip.org/wiki/amd/infinity_fabric https://developer.amd.com/wp-content/resources/56502_1.00-PUB.pdf https://developer.amd.com/wp-content/resources/56745_0.80.pdf https://developer.amd.com/wp-content/resources/56949_1.0.pdf

tanelpoder commented 3 years ago

Good info thanks. I guess I'll need a 32-core CPU someday (once I can buy one off eBay cheaply).

acollaborator commented 3 years ago

I have been staring at the P620 with the 12 core in my cart. This is for a home lab so I am also being thrifty.

tanelpoder commented 3 years ago

Yeah, even the "only" 12-core CPU is a beast! And you'll get all-core 4.0 GHz nominal speed, as I understand. Btw, some people praise how quiet it is, I would say that it's not that quiet when it's non-idle. But maybe I've gotten used to my laptop settings (I have turned turboboost off on my MBP when I'm working on regular stuff that doesn't need speed, so the fans usually don't go wild much).

acollaborator commented 3 years ago

Well, I did it. I purchased a P620. I chose a pre-configured 12-core CPU system because it was close to what I selected in the configurator and was cheaper.

Some folks have complained about a growling/rattling characteristic of the power supply fan. Lenovo will replace the PSU if that is bothersome. https://forums.lenovo.com/t5/ThinkStation-Workstations/Rattling-sound-in-P620/m-p/5067891?page=2#5279604

I might replace the front fan with a 92mm Noctua, but one has to be mindful that the Noctua's maximum airflow is about half of the factory fan. https://noctua.at/en/nf-a9-pwm/specification

I ran prime95 and hwinfo to observe power utilization of the 12-core CPU under load. I observed 150 W which is a relief since the AMD CPU marketing specification page lists 280 W. https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-pro-3945wx

I was disappointed to learn that Lenovo configures the CPU to only work on the P620. I was hoping to sell the CPU in the future to offset buying a higher core model. https://forums.lenovo.com/t5/ThinkStation-Workstations/P620-CPU-Locked-to-Motherboard-or-Vendor-Lenono/m-p/5070226?page=1#5279504

I have two 16 GB DIMMs and will probably just add two more for now given our previous discussion about 4 infinity fabric channels to core complexes in 12/16 core TR Pro.

tanelpoder commented 3 years ago

Cool, thanks for the research & details. I definitely heard some fan rattling from my machine too. Currently it's in a different room, so it's not a problem. Didn't know about the CPU limitation, I was thinking of buying a 32-core model someday too, but would suck if can't use the old CPU with another vendor's motherboard. I guess the power use is only at 150W as many of the cores/cache are disabled (but not sure why they'd have to spec it as 280W then). Or maybe when running AVX2 instructions on it, you'd see more power usage? (I don't know if prime95 is already using AVX2 or not...)

tanelpoder commented 3 years ago

When I have time for more experiments, I could remove 4 RDIMMs out of 8 and run similar performance tests again. But I would have to adjust fio or any I/O benchmark to actually touch the memory lines of blocks just read too, currently it's just PCIe->IO complex->DRAM traffic, I think and there's no DRAM to CPU cache traffic.

Shekelme commented 3 years ago

But what if these NVMEs are bundled into software raid0? What's the best way to get the best performance out of such an array?

tanelpoder commented 3 years ago

In short, modern Linux kernels handle millions of IOPS via LVM/software striping/RAID well.

I'm actually running some database workloads on striped Linux software LVM already (RAID-0 style). Software mirroring could be achieved with the linux MD module (mdadm). One thing to keep in mind, especially on somewhat older kernels is that you'd need to use multi-queue (mq) I/O handling to avoid having all CPUs contending for a single spinlock per device. The NVMe devices are always multi-queue on Linux, but other block devices (SCSI, SATA) and virtual block devices (DM) need to have it enabled.

On newest kernels these settings are enabled by default for SCSI/DM too, but there's a range of older versions (probably up to 4.x something) that require setting these kernel boot parameters to Y:

scsi_mod.use_blk_mq=y dm_mod.use_blk_mq=y

When I get to writing a part 2 for this post, I'll cover that too.

cityvigil commented 3 years ago

I learned to turn off Turbo Boost with this trick now.

Thanks. Summer is coming :)

aozgaa commented 2 years ago

I purchased a 12 cpu setup with the linked Asus carrier card and 4x Samsung 980 Pro (1TB) drives.

Unfortunately I am having some trouble replicating your results on the single disk benchmark.

After installing the carrier card in the PCIe Slot 1 and setting x4x4x4x4 and PCIe 4.0 in the BIOS as suggested, I am still seeing 8GT/s (downgraded) (ie: PCIe 3.0) in the LnkSta field of lspci -vv output, as well as corresponding slowdown in the single block device benchmark.

I'm a bit at a loss as to how to diagnose the PCIe downgrade.

Any thoughts on why the downgrading persists despite these BIOS settings? Did you maybe set some other BIOS settings in addition to those mentioned in the article?

aozgaa commented 2 years ago

(I should also say, thank you for the well-written article!)

tanelpoder commented 2 years ago

Thanks @aozgaa, I had to change two settings in BIOS - see this screenshot.

One settings was the PCIe bifurcation to 4x4x4x4 and the other one (where you see the dropdown menu open) was the Link Speed that I had to set from Auto to 16 GT/s.

tanelpoder commented 2 years ago

Oh I just re-read your post and saw that you did already choose the PCIe4.0 / 16 GTs instead of PCIe3 or Auto from the BIOS menu (right?)

tanelpoder commented 2 years ago

I guess the 1st question is are you sure if you have the right card (PCIe4.0) as ASUS also makes a similar PCIE3.0 card...

aozgaa commented 2 years ago

The ASUS card I got was this one.

It seems to be PCIe 4.0 compatible. In a RAID arrangement they claim 256Gbps across 16 lanes, which is nearly the limit of PCIe4.0 specs according to the sidebar in Wikipedia.

aozgaa commented 2 years ago

By modifying settings in BIOS I can deliberately downgrade to PCIe3.0 or even 2.0.

I can also confirm that if I don't set the x4x4x4x4 bifurcation, the drives are not detected at all (thanks for figuring this out to begin with :) ).

aozgaa commented 2 years ago

And to be explicit, yes, I picked PCIe4.0 and x4x4x4x4 in the BIOS menu for slot 1.

aozgaa commented 2 years ago

Okay, I have a solution, though it was obtained by bumbling about randomly and I'm not sure of the root cause.

I updated the BIOS firmware to the latest version from Lenovo's Thinkstation BIOS page, specifically s07sf23usa.zip, version S07KT23A, released 29 Sep 2021.

I now get consistent iops and bw results for your onessd.sh benchmark script!

One note,

aozgaa commented 2 years ago

with the BIOS update there is a new option in the setting for Data Link Layer support which I left in its default setting (enabled).

tanelpoder commented 2 years ago

Good to know, thanks! Yeah, maybe I got lucky as updating the BIOS to the latest was one of the first things I did when I got my server. Although I guess your initial BIOS was newer than what I had in my server when I received it last year. I was going to recommend you to try a different PCIe slot, suspecting some link negotiation signaling issue...

edgecase14 commented 2 years ago

I wonder about Linux software raid - from what I read it isnt' blk-mq or multi-queue "aware", mdraid or dmraid. I feel like RAID would be a prerequisite to using these in a server.

tanelpoder commented 2 years ago

Newer linux kernels support multiqueue for Device Mapper (dm) devices if the relevant kernel module is loaded (configured using dm_mod.use_blk_mq=y).

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.2_release_notes/storage

So, you can do software-mirroring (or RAID10 style mirror+striping) on Linux with multi-queues. But when you want all the enterprise features & bells-n-whistles and shared, remotely accessible storage, then you'd need a proper engineered storage solution when dealing with more than just one server... I recently did some hands on tech analysis on Silk's Platform (commercial product), but it's pretty clever regarding how they pull all those ephemeral local NVMe SSDs of cloud instances into one big reliable enterprise datastore:

https://tanelpoder.com/posts/testing-the-silk-platform/

mqudsi commented 2 years ago

@tanelpoder I came across this post for a second time and wanted to share with you this Netflix engineering article from 2017 about what it took to saturate a 100Gbps link, in case you hadn't come across it before.

It's FreeBSD not Linux, but many of the same issues you ran into are described in great detail (sudden drop in performance after everything is "fine" for a good amount of time, lock contention, global and per-thread locks, kernel management of free pages, etc.) and I thought you'd find it a fun read.

tanelpoder commented 2 years ago

Thanks @mqudsi, yeah I'm aware of that article - I actually bought 2 used 100 GBe NICs off eBay for some network testing too :-)

HighBubble commented 2 years ago

Hello @tanelpoder, many thanks for this post. Just a simple note, the lib io_uring is not yet available in RH8.4 (Bug 1881561 - Add io_uring suppor), but with your recommendations results are much beter with the simple libaio. Thanks a lot.

therealkevinc commented 2 years ago

When is the SLOB testing going to happen? :)

tanelpoder commented 2 years ago

Thanks @HighBubble for the comment & feedback! I was using the RHEL clone Oracle Enterprise Linux (with Oracle's newer kernel), I didn't hit this bug.

tanelpoder commented 2 years ago

Hi @therealkevinc, I have Postgres I/O testing (including with your "SLOB for Postgres") in my plans, but heavily behind the schedule right now!

tanelpoder / blog-comments

Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation | Tanel Poder Consulting #17

Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation | Tanel Poder Consulting