MadMax not scaling to 32 core 2990wx from 16 core 1950x at all. Performance regresses.

Furious-George commented 3 years ago

(edited for clarity)

I had a 1950x and recently upgraded to a 2990wx. The former is 16/32, the latter is 32/64 and one generation newer. I was previously completing a plot every 1550 seconds using: -r 32 -t /nvme/ -2 /tmpfs/.

I was not expecting to double output, but I was expecting improvement, perhaps in the 50% range. Instead, I achieved an approximate 33% reduction in speed, with a plot completing in 2100 seconds, using -r 64.

Here are some results from different systems for comparison. In all cases the OS is Debian and the nvme are Samsung EVO:

Reference System #1

Specs: TR 1950x: 16c/32t @ 3.85 ghz / 128 gb RAM @ 3000 mhz
Parallel Jobs: 1
Thread Multiplier: 32
Temp Dir Media: nvme
Temp2 Dir Media: tmpfs
Average Plot Time: 1550s

Test System - Test 1

Specs: TR 2990wx: 32c/64t @ 3.85 ghz / 128 gb RAM @ 3000 mhz
Parallel Jobs: 1
Thread Multiplier: 64
Temp Dir Media: nvme
Temp2 Dir Media: tmpfs
Average Plot Time: 2100s

Interestingly, reducing the thread multiplier improves performance:

Test System - Test 2

Specs: TR 2990wx: 32c/64t @ 3.85 ghz / 128 gb RAM @ 3000 mhz
Parallel Jobs: 1
Thread Multiplier: 32
Temp Dir Media: nvme
Temp2 Dir Media: tmpfs
Average Plot Time: 1550s

So reducing the multiplier to 32 increases output to the 1950x levels. I've also tried doubling it to 128, which also seemed to help, but I did not run a full plot. Importantly, the 1950x is utilizing 90%+ of its CPU (per htop) at peak during plotting, while the 2990wx hovers around 45-50%.

As a result, I've been trying parallel plotting. I don't have enough RAM to use tmpfs twice for temp2, but I have 4 x nvme that I can use for both temp and temp2 dirs. We can compare the results to a much older (in computer years) reference system which is also running parallel jobs, albeit with 2 x tmpfs. In both cases I assign numa groups accordingly:

Reference System 2

Specs: 2 x Xeon E5-2987 v2: 2x2c/24t @ 3.00 ghz / 256 gb RAM @ 1866 mhz
Parallel Jobs: 2
Thread Multiplier 1: 20
Thread Multiplier 2 20
Temp Dir Media 1: nvme
Temp Dir Media 2: nvme
Temp2 Dir Media 1: tmpfs
Temp2 Dir Media 2: tmpfs
Average Plot Time: 1200s

Test System - Test 3

Specs: TR 2990wx: 32c/64t @ 3.85 ghz / 128 gb RAM @ 3000 mhz
Parallel Jobs: 2
Thread Multiplier 1: 30
Thread Multiplier 2 30
Temp Dir Media 1: nvme
Temp Dir Media 2: nvme
Temp2 Dir Media 1: nvme
Temp2 Dir Media 2: nvme
Average Plot Time: 1200s

That's better, however again the CPU seems underutilized relative to the dual Xeons. It peaks at around 85%, per htop, whereas the dual Xeons are in the 90s, as was the 1950x. The 2990wx is actually probably nominally faster than the Xeons, but it has 33% more threads, a higher clock rate, is several years newer, faster ram, etc. That said, f I doubled the RAM, and plotted twice with -2 on tmpfs, I suspect the plot times would improve significantly, and perhaps the CPU would be better/more utilized.

I've tried three parallel jobs, e.g. with tmpfs and an ssd_array for the third, but so far there is a performance hit every way I've tried it.

While using 4 nvme drives for temp and temp2 mostly gets around the underperformance issues, it is not preferable logistically to using nvme and tmpfs as temp/2 for me, and I suspect most other people in a similar position. Plotting in parallel (or maybe 4 x in parallel, see below) with twice the RAM and tmpfs for both temp2 is probably the optimal configuration for this chipset, but that requires 256 gb of DDR4 (or 512 gb, for 4 jobs, see below).

With respect to why this apparent underperformance in non-parallel plotting is happening, I have not tested in other OS. I suspect the problem, however, has something to do with the chip architecture and how that translates to numa groups. This following is just speculation and above my paygrade:

The 2990wx is basically 4 processors on one die. However, only two memory busses exist for those 4 cpus, and I wonder if this might be part of the problem. This design has apparently been known to cause problems with scheduling in Windows, but not so much in Linux.

With that said, there seems to be something funny with the numa groups in Linux anyway:

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 64386 MB
node 0 free: 63512 MB
node 1 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 2 size: 64454 MB
node 2 free: 62807 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3
  0:  10  16  16  16
  1:  16  10  16  16
  2:  16  16  10  16
  3:  16  16  16  10

Above you see that there are 4 numa groups, but only two corresponding memory nodes, as described previously. In order to use the cores on cpu nodes that don't have corresponding memory nodes, I must invoke the cores explicitly, rather than by cpu node. Let's say I want to use only the cores associated with cpu 1:

# numactl --physcpubind=16-23,48-55 stress -c 16
stress: info: [3945] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
^C

(works as expected, except I expected --phycpubind to map to the 4 physical chiplets, not the 64 logical cores)

# numactl --cpunodebind=1  stress -c 16
libnuma: Warning: node argument 1 is out of range

(not allowed when there is not a corresponding memory node)

If I set --cpunodebind=0, that will not use the cores on cpu node 1 at all:

# numactl --cpunodebind=0  stress -c 16
stress: info: [4101] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
^C

(works as expected, but for cores on node 0 only)

--cpunodebind=0,1 also fails with '1 is out of range'

Again, just speculating, but perhaps this funny kernel behavior is not being accounted for by MadMax, and that is why there is no benefit, and actually a fair amount of cost, to doubling the core count when not plotting in parallel.

To be clear, all cores do get work when using -r 64, but they hover around 45-50%, as if MadMax is sending 50% of the necessary workload, which is getting scheduled evenly across all the cores.

I updated MadMax to latest before posting, but experienced no change.

Furious-George commented 3 years ago

Updated OP for clarity

itsme112358 commented 3 years ago

I suspect that your system starts swapping at that point. tmp2 requires ~110 GB (of your RAM) leaving only 18GB for your system. Every thread can use ~500MB afaik, so when you use 64 threads that makes 32 GB. Maybe try using smaller buckets?

therealflinchy commented 3 years ago

.... damn i never even bothered to check since i swapped

i swapped from my 2990 before madmax existed, and obviously it was a little better for parallel plotting/staggering (still not great as i'm sure you're aware

i never even thought about how the 1950x would go with madmax, having no NUMA nodes.

i did get 100% thread usage throughout phase 1 in htop/system monitor though.

from memory i was getting about 1400s a plot at best on my 2990 @ 4ghz, i had no idea i could simply get that from my 1950x too..

honestly, except for very very specific niche usage, the 2990 is just a garbage CPU and i regret buying it, for basically exactly the same price as a way faster 5950x setup.

i've been doing ~4 minute K29 plots for chives, which is also pretty garbage, but i'm done plotting now. i'll be selling and buying a 5950 before re-replotting for chia again lol.

I suspect that your system starts swapping at that point. tmp2 requires ~110 GB (of your RAM) leaving only 18GB for your system. Every thread can use ~500MB afaik, so when you use 64 threads that makes 32 GB. Maybe try using smaller buckets?

i had 192gb of RAM and it was basically the same performance as OP at best, maybe a hair faster but nothing worth being excited for... from memory was like 26-28 minutes a plot at best.

Furious-George commented 2 years ago

Hi.

It's definitely not an issue with swapping.

I doubled the ram and tried plotting entirely in ram, and still plot times using 62 threads were similar to or perhaps less than my 1950x. There was no swap enabled.

I also reinstalled my OS for lack of any better ideas.

Other workloads do not experience this drop in performance, so I'm fairly certain at this point that the problem is with MadMax.

@therealflinchy The 1950X does have two numa nodes. The 2950WX will be nominally faster, but not much. The difference between the 1950X /2950WX and the 2990WX is that the latter has two chiplets that do not have direct access to the RAM. The cores on these are sometimes known as "compute cores". All Zen 1 and 2 chipsets only support 2 numa memory nodes, so in order to get more than 2 chiplets on the platform, this compromise was necessary. Zen 3 allows for 4 nodes.

That said, for most workloads you don't see a lot of additional overhead on the compute cores, and in theory you should see great plot times. In practice, it seems the same or worse as the CPUs with half as many chiplets.

Furious-George commented 2 years ago

I updated MadMax to latest, and started plotting again, completely in RAM on all servers. As such, I'm able to make an apples-to-apples comparison:

2990wx:

Number of Threads: 60
Number of Buckets P1:    2^8 (256)
Number of Buckets P3+P4: 2^8 (256)
Working Directory:   /tmpfs/
Working Directory 2: /tmpfs/
Plot Name: plot-k32-2021-12-28-02-08-66d86c8f3bc9b4eaa89c48b2ef9bf068f8a567a41482e94d89cff25fbe29b237
[P1] Table 1 took 11.916 sec
[P1] Table 2 took 82.9813 sec, found 4294852923 matches
[P1] Table 3 took 100.88 sec, found 4294779441 matches
[P1] Table 4 took 115.192 sec, found 4294680045 matches
[P1] Table 5 took 116.095 sec, found 4294392988 matches
[P1] Table 6 took 115.467 sec, found 4293848326 matches
[P1] Table 7 took 88.7658 sec, found 4292789541 matches
Phase 1 took 631.31 sec
[P2] max_table_size = 4294967296
[P2] Table 7 scan took 7.11696 sec
[P2] Table 7 rewrite took 35.0174 sec, dropped 0 entries (0 %)
[P2] Table 6 scan took 39.0611 sec
[P2] Table 6 rewrite took 47.2855 sec, dropped 581372700 entries (13.5397 %)
[P2] Table 5 scan took 37.0227 sec
[P2] Table 5 rewrite took 51.96 sec, dropped 762134041 entries (17.7472 %)
[P2] Table 4 scan took 29.1986 sec
[P2] Table 4 rewrite took 39.4493 sec, dropped 828962215 entries (19.3021 %)
[P2] Table 3 scan took 27.8952 sec
[P2] Table 3 rewrite took 47.5046 sec, dropped 855125027 entries (19.9108 %)
[P2] Table 2 scan took 27.236 sec
[P2] Table 2 rewrite took 39.1883 sec, dropped 865601907 entries (20.1544 %)
Phase 2 took 442.355 sec
Wrote plot header with 268 bytes
[P3-1] Table 2 took 56.7152 sec, wrote 3429251016 right entries
[P3-2] Table 2 took 33.7497 sec, wrote 3429251016 left entries, 3429251016 final
[P3-1] Table 3 took 50.5708 sec, wrote 3439654414 right entries
[P3-2] Table 3 took 34.7599 sec, wrote 3439654414 left entries, 3439654414 final
[P3-1] Table 4 took 50.789 sec, wrote 3465717830 right entries
[P3-2] Table 4 took 35.4172 sec, wrote 3465717830 left entries, 3465717830 final
[P3-1] Table 5 took 51.2625 sec, wrote 3532258947 right entries
[P3-2] Table 5 took 34.8323 sec, wrote 3532258947 left entries, 3532258947 final
[P3-1] Table 6 took 51.8092 sec, wrote 3712475626 right entries
[P3-2] Table 6 took 35.2032 sec, wrote 3712475626 left entries, 3712475626 final
[P3-1] Table 7 took 59.2067 sec, wrote 4292789541 right entries
[P3-2] Table 7 took 42.0758 sec, wrote 4292789541 left entries, 4292789541 final
Phase 3 took 540.775 sec, wrote 21872147374 entries to final plot
[P4] Starting to write C1 and C3 tables
[P4] Finished writing C1 and C3 tables
[P4] Writing C2 table
[P4] Finished writing C2 table
Phase 4 took 253.26 sec, final plot size is 108805506829 bytes
Total plot creation time was 1867.76 sec (31.1293 min)

Dual E2-2997-V2:

Number of Threads: 46
Number of Buckets P1:    2^8 (256)
Number of Buckets P3+P4: 2^8 (256)
Working Directory:   /tmpfs/
Working Directory 2: /tmpfs/
Plot Name: plot-k32-2021-12-28-01-54-60239de38ec96512d6e1de1eedd123119d6bee3d40092cfe190f6f0cd8397251
[P1] Table 1 took 8.98374 sec
[P1] Table 2 took 86.2145 sec, found 4294922988 matches
[P1] Table 3 took 99.5499 sec, found 4294824733 matches
[P1] Table 4 took 113.48 sec, found 4294632323 matches
[P1] Table 5 took 111.36 sec, found 4294150263 matches
[P1] Table 6 took 108.755 sec, found 4293372916 matches
[P1] Table 7 took 87.9674 sec, found 4291763123 matches
Phase 1 took 616.336 sec
[P2] max_table_size = 4294967296
[P2] Table 7 scan took 5.73197 sec
[P2] Table 7 rewrite took 43.346 sec, dropped 0 entries (0 %)
[P2] Table 6 scan took 24.304 sec
[P2] Table 6 rewrite took 29.1287 sec, dropped 581477302 entries (13.5436 %)
[P2] Table 5 scan took 20.5568 sec
[P2] Table 5 rewrite took 27.2551 sec, dropped 762200459 entries (17.7497 %)
[P2] Table 4 scan took 20.3755 sec
[P2] Table 4 rewrite took 26.6998 sec, dropped 829074120 entries (19.3049 %)
[P2] Table 3 scan took 19.8067 sec
[P2] Table 3 rewrite took 26.2138 sec, dropped 855174621 entries (19.9117 %)
[P2] Table 2 scan took 19.0411 sec
[P2] Table 2 rewrite took 57.3419 sec, dropped 865652869 entries (20.1553 %)
Phase 2 took 339.322 sec
Wrote plot header with 268 bytes
[P3-1] Table 2 took 45.7961 sec, wrote 3429270119 right entries
[P3-2] Table 2 took 23.3227 sec, wrote 3429270119 left entries, 3429270119 final
[P3-1] Table 3 took 43.8826 sec, wrote 3439650112 right entries
[P3-2] Table 3 took 23.9396 sec, wrote 3439650112 left entries, 3439650112 final
[P3-1] Table 4 took 44.7279 sec, wrote 3465558203 right entries
[P3-2] Table 4 took 23.4904 sec, wrote 3465558203 left entries, 3465558203 final
[P3-1] Table 5 took 45.3682 sec, wrote 3531949804 right entries
[P3-2] Table 5 took 23.8252 sec, wrote 3531949804 left entries, 3531949804 final
[P3-1] Table 6 took 45.7044 sec, wrote 3711895614 right entries
[P3-2] Table 6 took 24.8254 sec, wrote 3711895614 left entries, 3711895614 final
[P3-1] Table 7 took 46.8918 sec, wrote 4291763123 right entries
[P3-2] Table 7 took 29.154 sec, wrote 4291763123 left entries, 4291763123 final
Phase 3 took 426.057 sec, wrote 21870086975 entries to final plot
[P4] Starting to write C1 and C3 tables
[P4] Finished writing C1 and C3 tables
[P4] Writing C2 table
[P4] Finished writing C2 table
Phase 4 took 56.4867 sec, final plot size is 108792606718 bytes
Total plot creation time was 1438.29 sec (23.9714 min)

Bottom line is that the 2990wx is over 25% slower than the E5-2997 V2, despite the latter being over 8 years old.

I've reached times as low as 1300 seconds with a 1950x, when pushing the overclock, and using tmpfs for -2 and nvme for -t. In that case, having half the cores allowed me to push the OC a bit more, and not needing as much RAM allowed for faster RAM speeds.

I updated MadMax before running the test.

madMAx43v3r / chia-plotter