Open utterances-bot opened 3 years ago
Bad ass!
What are the chances of a Windows system Being optimized similarly and attaining similar results?? I am without saying totally ignorant when it comes to Linux unfortunately..
I do have similar hardware and would love to know of and or if there are ways to optimize a windows 10 system.
Awesome deep-dive! Very interesting read, looking forward to those future articles!
Hi Tanel,
Thanks for sharing this analysis. I was wondering if you'd tried the test without the --fixedbufs
option? On my system, I'm seeing a marginal latency decrease when using fixed buffers, however, I don't have enough devices or CPU threads to approach my machine's memory bandwidth capacity. Do you know to what degree, if any, your performance is affected if --fixedbufs
is removed?
Well done! Learned a good deal from your experience and excellent article: mainly the placement of the hyper cards as an important concern. Lenovo’s forum shows some concerns with 2TB ssds— available then pulled due to an issue. I’ve got the asus hyper card for PCIe v4 and will soon have some 980 Pro (tho I don’t like the change in hand from the 970 pro).
Thank you for sharing so much great information in this millenium.
Does the 256 GB in your P620 consist 4 x 64 GB or 8 x 32 GB RDIMMs?
I suspect the 12-core 3945WX and the 16-core 3955WX are contructed like the Epyc 7272 and 7282 that have 4 memory channels from the SoC I/O controller to the cores rather than 8 because of having 2 CCDs per package instead of 4. The L3 cache size of 64 MB is the same among all of these. If this is the case, I think that 100 GB/s throughput to main memory rather 200 GB/s for these SKUs is a fine value trade-off. I just wish I could find reference material from AMD that describes if this is the case.
Here is a some discussion related to this. https://www.servethehome.com/amd-epyc-7002-rome-cpus-with-half-memory-bandwidth/
I used 8 x 32 GB DIMMs. Yep I have seen this article too and it makes me wonder. Once the 32 core CPUs are available via direct channels (cheaper), I'm sure I'll be tempted to get one to get to full memory bandwidth. I have yet to run a test where I actually touch the memory lines after loading to cache, that may tell us more about the internal memory bandwidth/performance. Thanks for the comment!
I guess I could test this by just taking 4 DIMMs out and testing if the performance/throughput drops noticeably.
I ran mlc
in a video about this experiment, I haven't checked how thorough & optimized this tool is, but it reported ~81.8 GB/s of internal throughput: https://youtu.be/5A531KE8O9Q?t=3833
Writing this comment is part of my process of learning about infinity fabric connections and throughput.
~81.8 GB/s of internal throughput is similar to a 4 memory channel Threadripper measurement Ian Cuttress gathered using the Windows utility AIDA64 at https://www.anandtech.com/bench/product/2766?vs=2631
Here is a summary of Threadripper results by number of populated DDR4 channels from Anandtech.
model # Read Write Copy (MB/s)
3995WX 8 155264 152976 165361
3990X 4 85076 82518 81110
3995WX 2 45469 41743 44813
Zen2 consists of groups of 1 to 4 cores that share distinct 16 MB Level 3 caches. Each distinct Level 3 cache can transfer at a rate equivalent to one DDR4 channel (25600 MT at DDR4-3200). Each chiplet(CCD) has two 16 MB L3 caches and each of those caches is shared by 1 to 4 cores (depending on sku). Therefore, maximum throughput of a chiplet(CCD) is ~50 GB/s. Perhaps one CCD (a chiplet with 1 to 8 cores depending on sku) is the equivalent of the smallest definable NUMA node.
If I am understanding what I have read, PCIe data transfer from any single Zen2 core with 1600 MHz infinity fabric has a ceiling of ~22 GB/s. You mentioned sdk.io reaching 11 GB/s using one core. Neat. Pedal to the almost bare metal.
Resources I am reading to understand this. https://en.wikichip.org/wiki/File:amd_if_ifop_link_example.svg. https://en.wikichip.org/wiki/amd/infinity_fabric https://developer.amd.com/wp-content/resources/56502_1.00-PUB.pdf https://developer.amd.com/wp-content/resources/56745_0.80.pdf https://developer.amd.com/wp-content/resources/56949_1.0.pdf
Good info thanks. I guess I'll need a 32-core CPU someday (once I can buy one off eBay cheaply).
I have been staring at the P620 with the 12 core in my cart. This is for a home lab so I am also being thrifty.
Yeah, even the "only" 12-core CPU is a beast! And you'll get all-core 4.0 GHz nominal speed, as I understand. Btw, some people praise how quiet it is, I would say that it's not that quiet when it's non-idle. But maybe I've gotten used to my laptop settings (I have turned turboboost off on my MBP when I'm working on regular stuff that doesn't need speed, so the fans usually don't go wild much).
Well, I did it. I purchased a P620. I chose a pre-configured 12-core CPU system because it was close to what I selected in the configurator and was cheaper.
Some folks have complained about a growling/rattling characteristic of the power supply fan. Lenovo will replace the PSU if that is bothersome. https://forums.lenovo.com/t5/ThinkStation-Workstations/Rattling-sound-in-P620/m-p/5067891?page=2#5279604
I might replace the front fan with a 92mm Noctua, but one has to be mindful that the Noctua's maximum airflow is about half of the factory fan. https://noctua.at/en/nf-a9-pwm/specification
I ran prime95 and hwinfo to observe power utilization of the 12-core CPU under load. I observed 150 W which is a relief since the AMD CPU marketing specification page lists 280 W. https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-pro-3945wx
I was disappointed to learn that Lenovo configures the CPU to only work on the P620. I was hoping to sell the CPU in the future to offset buying a higher core model. https://forums.lenovo.com/t5/ThinkStation-Workstations/P620-CPU-Locked-to-Motherboard-or-Vendor-Lenono/m-p/5070226?page=1#5279504
I have two 16 GB DIMMs and will probably just add two more for now given our previous discussion about 4 infinity fabric channels to core complexes in 12/16 core TR Pro.
Cool, thanks for the research & details. I definitely heard some fan rattling from my machine too. Currently it's in a different room, so it's not a problem. Didn't know about the CPU limitation, I was thinking of buying a 32-core model someday too, but would suck if can't use the old CPU with another vendor's motherboard. I guess the power use is only at 150W as many of the cores/cache are disabled (but not sure why they'd have to spec it as 280W then). Or maybe when running AVX2 instructions on it, you'd see more power usage? (I don't know if prime95 is already using AVX2 or not...)
When I have time for more experiments, I could remove 4 RDIMMs out of 8 and run similar performance tests again. But I would have to adjust fio
or any I/O benchmark to actually touch the memory lines of blocks just read too, currently it's just PCIe->IO complex->DRAM traffic, I think and there's no DRAM to CPU cache traffic.
But what if these NVMEs are bundled into software raid0? What's the best way to get the best performance out of such an array?
In short, modern Linux kernels handle millions of IOPS via LVM/software striping/RAID well.
I'm actually running some database workloads on striped Linux software LVM already (RAID-0 style). Software mirroring could be achieved with the linux MD module (mdadm). One thing to keep in mind, especially on somewhat older kernels is that you'd need to use multi-queue (mq) I/O handling to avoid having all CPUs contending for a single spinlock per device. The NVMe devices are always multi-queue on Linux, but other block devices (SCSI, SATA) and virtual block devices (DM) need to have it enabled.
On newest kernels these settings are enabled by default for SCSI/DM too, but there's a range of older versions (probably up to 4.x something) that require setting these kernel boot parameters to Y:
scsi_mod.use_blk_mq=y dm_mod.use_blk_mq=y
When I get to writing a part 2 for this post, I'll cover that too.
I learned to turn off Turbo Boost with this trick now.
Thanks. Summer is coming :)
I purchased a 12 cpu setup with the linked Asus carrier card and 4x Samsung 980 Pro (1TB) drives.
Unfortunately I am having some trouble replicating your results on the single disk benchmark.
After installing the carrier card in the PCIe Slot 1 and setting x4x4x4x4 and PCIe 4.0 in the BIOS as suggested, I am still seeing 8GT/s (downgraded)
(ie: PCIe 3.0) in the LnkSta
field of lspci -vv
output, as well as corresponding slowdown in the single block device benchmark.
I'm a bit at a loss as to how to diagnose the PCIe downgrade.
Any thoughts on why the downgrading persists despite these BIOS settings? Did you maybe set some other BIOS settings in addition to those mentioned in the article?
(I should also say, thank you for the well-written article!)
Thanks @aozgaa, I had to change two settings in BIOS - see this screenshot.
One settings was the PCIe bifurcation to 4x4x4x4 and the other one (where you see the dropdown menu open) was the Link Speed that I had to set from Auto to 16 GT/s.
Oh I just re-read your post and saw that you did already choose the PCIe4.0 / 16 GTs instead of PCIe3 or Auto from the BIOS menu (right?)
I guess the 1st question is are you sure if you have the right card (PCIe4.0) as ASUS also makes a similar PCIE3.0 card...
By modifying settings in BIOS I can deliberately downgrade to PCIe3.0 or even 2.0.
I can also confirm that if I don't set the x4x4x4x4 bifurcation, the drives are not detected at all (thanks for figuring this out to begin with :) ).
And to be explicit, yes, I picked PCIe4.0 and x4x4x4x4 in the BIOS menu for slot 1.
Okay, I have a solution, though it was obtained by bumbling about randomly and I'm not sure of the root cause.
I updated the BIOS firmware to the latest version from Lenovo's Thinkstation BIOS page, specifically s07sf23usa.zip
, version S07KT23A
, released 29 Sep 2021
.
I now get consistent iops and bw results for your onessd.sh
benchmark script!
One note,
with the BIOS update there is a new option in the setting for Data Link Layer support which I left in its default setting (enabled).
Good to know, thanks! Yeah, maybe I got lucky as updating the BIOS to the latest was one of the first things I did when I got my server. Although I guess your initial BIOS was newer than what I had in my server when I received it last year. I was going to recommend you to try a different PCIe slot, suspecting some link negotiation signaling issue...
I wonder about Linux software raid - from what I read it isnt' blk-mq or multi-queue "aware", mdraid or dmraid. I feel like RAID would be a prerequisite to using these in a server.
Newer linux kernels support multiqueue for Device Mapper (dm) devices if the relevant kernel module is loaded (configured using dm_mod.use_blk_mq=y
).
So, you can do software-mirroring (or RAID10 style mirror+striping) on Linux with multi-queues. But when you want all the enterprise features & bells-n-whistles and shared, remotely accessible storage, then you'd need a proper engineered storage solution when dealing with more than just one server... I recently did some hands on tech analysis on Silk's Platform (commercial product), but it's pretty clever regarding how they pull all those ephemeral local NVMe SSDs of cloud instances into one big reliable enterprise datastore:
@tanelpoder I came across this post for a second time and wanted to share with you this Netflix engineering article from 2017 about what it took to saturate a 100Gbps link, in case you hadn't come across it before.
It's FreeBSD not Linux, but many of the same issues you ran into are described in great detail (sudden drop in performance after everything is "fine" for a good amount of time, lock contention, global and per-thread locks, kernel management of free pages, etc.) and I thought you'd find it a fun read.
Thanks @mqudsi, yeah I'm aware of that article - I actually bought 2 used 100 GBe NICs off eBay for some network testing too :-)
Hello @tanelpoder, many thanks for this post. Just a simple note, the lib io_uring is not yet available in RH8.4 (Bug 1881561 - Add io_uring suppor), but with your recommendations results are much beter with the simple libaio. Thanks a lot.
When is the SLOB testing going to happen? :)
Thanks @HighBubble for the comment & feedback! I was using the RHEL clone Oracle Enterprise Linux (with Oracle's newer kernel), I didn't hit this bug.
Hi @therealkevinc, I have Postgres I/O testing (including with your "SLOB for Postgres") in my plans, but heavily behind the schedule right now!
Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation | Tanel Poder Consulting
TL;DR Modern disks are so fast that system performance bottleneck shifts to RAM access and CPU. With up to 64 cores, PCIe 4.0 and 8 memory channels, even a single-socket AMD ThreadRipper Pro workstation makes a hell of a powerful machine - if you do it right! Introduction In this post I’ll explain how I configured my AMD ThreadRipper Pro workstation with 10 PCIe 4.0 SSDs to achieve 11M IOPS with 4kB random reads and 66 GiB/s throughput with larger IOs - and what bottlenecks & issues I fixed to get there. - Linux, Oracle, SQL performance tuning and troubleshooting - consulting & training.
https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/