Raspberry Pi 5 4GB is faster (sometimes significantly so) than the 8GB version

Brunnis commented 11 months ago

Describe the bug The Raspberry Pi 5 4GB performs slightly better (0-10%) than the 8GB version at default 2.4 GHz and the gap widens to >100% for certain workloads when overclocked. These workloads will see a dramatic reduction in performance when overclocking the 8GB board. This reduction in performance is not present at all on the 4GB board.

It's unclear to me whether the small performance difference at default clock frequency has the same root cause as the more dramatic one that emerges as the ARM core frequency is increased. For the time being I'm treating them as related and reporting on both in this issue.

To reproduce I have found two workloads in particular that expose the issue: Geekbench 5 "Text Rendering" multi-core sub-test and stress-ng "numa" stressor. To reproduce this issue, I suggest benchmarking the 4GB and 8GB boards at both 2.4 GHz and 2.8 GHz.

Geekbench 5 is available here: https://www.geekbench.com/preview/

To run stress-ng with profiling info:

Install stress-ng and the "perf" tool: sudo apt install stress-ng linux-perf linux-perf-dbgsym
Run the following to enable perf to work with stress-ng: sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'
Perform the test and note the bogo ops/s value: stress-ng --numa 4 --numa-ops 1000 --metrics --perf

Expected behaviour Ideally the 4GB and 8GB boards would perform close to the same. The smaller difference at stock frequencies could be deemed normal/expected (for example due to different RAM ICs being used), but the dramatically increasing gap in performance as ARM core frequency increases suggests something may be misbehaving.

Actual behaviour The 4GB board is anywhere from a few to more than 100% faster than the 8GB board, depending on clock frequency and workload. Below is a summary of tests I've run. As can be seen, the 4GB uses Samsung RAM and the 8GB uses Micron RAM.

Pi 5 4GB vs 8GB v2

These benchmark results are completely reproducible. I've also looked at other people's submission of Geekbench 5 results and can see the same reduction in "Text Rendering" scores on overclocked 8GB boards (but not on overclocked 4GB boards), so this is not limited to my specimen.

Below are the Geekbench 5 results at 2.4 and 2.8 GHz for the runs listed in the table above.

4GB (2400 MHz): https://browser.geekbench.com/v5/cpu/22028307 8GB (2400 MHz): https://browser.geekbench.com/v5/cpu/22028116 4GB (2800 MHz): https://browser.geekbench.com/v5/cpu/22028479 8GB (2800 MHz): https://browser.geekbench.com/v5/cpu/22028225

Below are "perf" tool output for the stress-ng runs at 2.4 and 2.8 GHz for both boards:

4GB (2.4 GHz):

stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info:  [3278] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [3278] dispatching hogs: 4 numa
stress-ng: info:  [3279] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [3278] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [3278]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [3278] numa               1000      8.44     33.33      0.05       118.48          29.96        98.85          7040
stress-ng: info:  [3278] numa:
stress-ng: info:  [3278]             81,220,668,668 CPU Cycles                     9.480 B/sec
stress-ng: info:  [3278]              4,757,399,712 Instructions                   0.555 B/sec (0.059 instr. per cycle)
stress-ng: info:  [3278]                    258,736 Branch Misses                 30.199 K/sec ( 0.000%)
stress-ng: info:  [3278]                122,043,476 Stalled Cycles Frontend       14.244 M/sec
stress-ng: info:  [3278]             76,572,843,124 Stalled Cycles Backend         8.937 B/sec
stress-ng: info:  [3278]             81,230,134,176 Bus Cycles                     9.481 B/sec
stress-ng: info:  [3278]              4,349,256,396 Cache References               0.508 B/sec
stress-ng: info:  [3278]                  1,719,788 Cache Misses                   0.201 M/sec ( 0.040%)
stress-ng: info:  [3278]              4,436,001,520 Cache L1D Read                 0.518 B/sec
stress-ng: info:  [3278]                  1,745,244 Cache L1D Read Miss            0.204 M/sec ( 0.039%)
stress-ng: info:  [3278]              1,235,273,488 Cache L1I Read                 0.144 B/sec
stress-ng: info:  [3278]                    665,872 Cache L1I Read Miss           77.718 K/sec
stress-ng: info:  [3278]                  1,549,288 Cache LL Read                  0.181 M/sec
stress-ng: info:  [3278]                  1,187,048 Cache LL Read Miss             0.139 M/sec (76.619%)
stress-ng: info:  [3278]              4,343,855,792 Cache DTLB Read                0.507 B/sec
stress-ng: info:  [3278]                  5,006,956 Cache DTLB Read Miss           0.584 M/sec ( 0.115%)
stress-ng: info:  [3278]              1,160,470,704 Cache BPU Read                 0.135 B/sec
stress-ng: info:  [3278]                    220,720 Cache BPU Read Miss           25.762 K/sec ( 0.019%)
stress-ng: info:  [3278]             33,889,417,812 CPU Clock                      3.955 B/sec
stress-ng: info:  [3278]             33,889,294,388 Task Clock                     3.955 B/sec
stress-ng: info:  [3278]                      1,060 Page Faults Total            123.720 /sec 
stress-ng: info:  [3278]                      1,060 Page Faults Minor            123.720 /sec 
stress-ng: info:  [3278]                          0 Page Faults Major              0.000 /sec 
stress-ng: info:  [3278]                        228 Context Switches              26.611 /sec 
stress-ng: info:  [3278]                        160 Cgroup Switches               18.675 /sec 
stress-ng: info:  [3278]                          0 CPU Migrations                 0.000 /sec 
stress-ng: info:  [3278]                          0 Alignment Faults               0.000 /sec 
stress-ng: info:  [3278]                          0 Emulation Faults               0.000 /sec 
stress-ng: info:  [3278] successful run completed in 8.57s

8GB (2.4 GHz):

stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info:  [2825] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [2825] dispatching hogs: 4 numa
stress-ng: info:  [2826] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [2825] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [2825]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [2825] numa               1000      9.33     37.08      0.06       107.20          26.93        99.51          7072
stress-ng: info:  [2825] numa:
stress-ng: info:  [2825]             87,844,236,208 CPU Cycles                     9.284 B/sec
stress-ng: info:  [2825]              4,775,378,096 Instructions                   0.505 B/sec (0.054 instr. per cycle)
stress-ng: info:  [2825]                    199,680 Branch Misses                 21.103 K/sec ( 0.000%)
stress-ng: info:  [2825]                148,631,000 Stalled Cycles Frontend       15.708 M/sec
stress-ng: info:  [2825]             83,139,642,756 Stalled Cycles Backend         8.786 B/sec
stress-ng: info:  [2825]             87,849,436,396 Bus Cycles                     9.284 B/sec
stress-ng: info:  [2825]              4,360,769,624 Cache References               0.461 B/sec
stress-ng: info:  [2825]                  3,826,552 Cache Misses                   0.404 M/sec ( 0.088%)
stress-ng: info:  [2825]              4,326,723,104 Cache L1D Read                 0.457 B/sec
stress-ng: info:  [2825]                  3,818,744 Cache L1D Read Miss            0.404 M/sec ( 0.088%)
stress-ng: info:  [2825]              1,205,851,084 Cache L1I Read                 0.127 B/sec
stress-ng: info:  [2825]                    688,460 Cache L1I Read Miss           72.759 K/sec
stress-ng: info:  [2825]                  3,388,484 Cache LL Read                  0.358 M/sec
stress-ng: info:  [2825]                  2,846,120 Cache LL Read Miss             0.301 M/sec (83.994%)
stress-ng: info:  [2825]              4,366,253,912 Cache DTLB Read                0.461 B/sec
stress-ng: info:  [2825]                  5,043,928 Cache DTLB Read Miss           0.533 M/sec ( 0.116%)
stress-ng: info:  [2825]              1,179,947,616 Cache BPU Read                 0.125 B/sec
stress-ng: info:  [2825]                    194,724 Cache BPU Read Miss           20.579 K/sec ( 0.017%)
stress-ng: info:  [2825]             36,669,122,012 CPU Clock                      3.875 B/sec
stress-ng: info:  [2825]             36,668,453,648 Task Clock                     3.875 B/sec
stress-ng: info:  [2825]                      1,064 Page Faults Total            112.447 /sec 
stress-ng: info:  [2825]                      1,064 Page Faults Minor            112.447 /sec 
stress-ng: info:  [2825]                          0 Page Faults Major              0.000 /sec 
stress-ng: info:  [2825]                        360 Context Switches              38.046 /sec 
stress-ng: info:  [2825]                        360 Cgroup Switches               38.046 /sec 
stress-ng: info:  [2825]                          0 CPU Migrations                 0.000 /sec 
stress-ng: info:  [2825]                          0 Alignment Faults               0.000 /sec 
stress-ng: info:  [2825]                          0 Emulation Faults               0.000 /sec 
stress-ng: info:  [2825] successful run completed in 9.46s

4GB (2.8 GHz):

stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info:  [2897] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [2897] dispatching hogs: 4 numa
stress-ng: info:  [2898] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [2897] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [2897]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [2897] numa               1000      8.43     33.43      0.06       118.66          29.86        99.35          7024
stress-ng: info:  [2897] numa:
stress-ng: info:  [2897]             94,804,086,272 CPU Cycles                    11.088 B/sec
stress-ng: info:  [2897]              4,743,732,748 Instructions                   0.555 B/sec (0.050 instr. per cycle)
stress-ng: info:  [2897]                    171,072 Branch Misses                 20.008 K/sec ( 0.000%)
stress-ng: info:  [2897]                141,005,824 Stalled Cycles Frontend       16.491 M/sec
stress-ng: info:  [2897]             90,175,677,856 Stalled Cycles Backend        10.546 B/sec
stress-ng: info:  [2897]             94,811,739,556 Bus Cycles                    11.089 B/sec
stress-ng: info:  [2897]              4,351,381,748 Cache References               0.509 B/sec
stress-ng: info:  [2897]                  2,356,576 Cache Misses                   0.276 M/sec ( 0.054%)
stress-ng: info:  [2897]              4,363,258,428 Cache L1D Read                 0.510 B/sec
stress-ng: info:  [2897]                  2,354,384 Cache L1D Read Miss            0.275 M/sec ( 0.054%)
stress-ng: info:  [2897]              1,205,560,624 Cache L1I Read                 0.141 B/sec
stress-ng: info:  [2897]                    579,892 Cache L1I Read Miss           67.821 K/sec
stress-ng: info:  [2897]                  2,040,828 Cache LL Read                  0.239 M/sec
stress-ng: info:  [2897]                  1,692,944 Cache LL Read Miss             0.198 M/sec (82.954%)
stress-ng: info:  [2897]              4,372,534,272 Cache DTLB Read                0.511 B/sec
stress-ng: info:  [2897]                  5,064,992 Cache DTLB Read Miss           0.592 M/sec ( 0.116%)
stress-ng: info:  [2897]              1,175,805,832 Cache BPU Read                 0.138 B/sec
stress-ng: info:  [2897]                    140,216 Cache BPU Read Miss           16.399 K/sec ( 0.012%)
stress-ng: info:  [2897]             33,905,985,500 CPU Clock                      3.965 B/sec
stress-ng: info:  [2897]             33,905,692,244 Task Clock                     3.965 B/sec
stress-ng: info:  [2897]                      1,064 Page Faults Total            124.440 /sec 
stress-ng: info:  [2897]                      1,064 Page Faults Minor            124.440 /sec 
stress-ng: info:  [2897]                          0 Page Faults Major              0.000 /sec 
stress-ng: info:  [2897]                        224 Context Switches              26.198 /sec 
stress-ng: info:  [2897]                        204 Cgroup Switches               23.859 /sec 
stress-ng: info:  [2897]                          0 CPU Migrations                 0.000 /sec 
stress-ng: info:  [2897]                          0 Alignment Faults               0.000 /sec 
stress-ng: info:  [2897]                          0 Emulation Faults               0.000 /sec 
stress-ng: info:  [2897] successful run completed in 8.55s

8GB (2.8 GHz):

stress-ng --numa 4 --numa-ops 1000 --metrics --perf
stress-ng: info:  [2065] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [2065] dispatching hogs: 4 numa
stress-ng: info:  [2066] numa: system has 1 of a maximum 4 memory NUMA nodes
stress-ng: metrc: [2065] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [2065]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [2065] numa               1000     19.53     77.52      0.19        51.19          12.87        99.45          7024
stress-ng: info:  [2065] numa:
stress-ng: info:  [2065]            213,041,661,036 CPU Cycles                    10.802 B/sec
stress-ng: info:  [2065]              4,850,988,464 Instructions                   0.246 B/sec (0.023 instr. per cycle)
stress-ng: info:  [2065]                    252,376 Branch Misses                 12.796 K/sec ( 0.000%)
stress-ng: info:  [2065]                635,563,820 Stalled Cycles Frontend       32.225 M/sec
stress-ng: info:  [2065]            207,857,257,028 Stalled Cycles Backend        10.539 B/sec
stress-ng: info:  [2065]            213,076,738,404 Bus Cycles                    10.804 B/sec
stress-ng: info:  [2065]              4,413,222,720 Cache References               0.224 B/sec
stress-ng: info:  [2065]                 67,889,660 Cache Misses                   3.442 M/sec ( 1.538%)
stress-ng: info:  [2065]              4,424,491,460 Cache L1D Read                 0.224 B/sec
stress-ng: info:  [2065]                 67,486,816 Cache L1D Read Miss            3.422 M/sec ( 1.525%)
stress-ng: info:  [2065]              1,239,622,236 Cache L1I Read                62.852 M/sec
stress-ng: info:  [2065]                  1,194,376 Cache L1I Read Miss           60.558 K/sec
stress-ng: info:  [2065]                 73,587,652 Cache LL Read                  3.731 M/sec
stress-ng: info:  [2065]                 67,162,612 Cache LL Read Miss             3.405 M/sec (91.269%)
stress-ng: info:  [2065]              4,374,670,156 Cache DTLB Read                0.222 B/sec
stress-ng: info:  [2065]                  5,522,360 Cache DTLB Read Miss           0.280 M/sec ( 0.126%)
stress-ng: info:  [2065]              1,194,694,580 Cache BPU Read                60.574 M/sec
stress-ng: info:  [2065]                    210,600 Cache BPU Read Miss           10.678 K/sec ( 0.018%)
stress-ng: info:  [2065]             76,637,242,788 CPU Clock                      3.886 B/sec
stress-ng: info:  [2065]             76,635,993,260 Task Clock                     3.886 B/sec
stress-ng: info:  [2065]                      1,064 Page Faults Total             53.948 /sec 
stress-ng: info:  [2065]                      1,064 Page Faults Minor             53.948 /sec 
stress-ng: info:  [2065]                          0 Page Faults Major              0.000 /sec 
stress-ng: info:  [2065]                        556 Context Switches              28.191 /sec 
stress-ng: info:  [2065]                        536 Cgroup Switches               27.177 /sec 
stress-ng: info:  [2065]                          0 CPU Migrations                 0.000 /sec 
stress-ng: info:  [2065]                          0 Alignment Faults               0.000 /sec 
stress-ng: info:  [2065]                          0 Emulation Faults               0.000 /sec 
stress-ng: info:  [2065] successful run completed in 19.72s

The 8GB 2.8 GHz result sticks out when compared to the same board running at 2.4 GHz, due to:

Instructions per cycle is reduced to less than half: 0.050 vs 0.023 instr. per cycle
Stalled Cycles Frontend (per second) doubles: 16.491 M/sec vs 32.225 M/sec
Cache L1D Read Miss increases from 275 000 per second to 3 422 000 per second
Cache LL Read & Cache LL Read Miss go from 239 M/sec & 198 M/sec to 3.7 M/sec & 3.4 M/sec

Finally, I should mention that RAM bandwidth and latency tests do not show any issues.

System raspinfo output: https://gist.github.com/Brunnis/4d8242cf757f28e1d5331b3f73b3a446

popcornmix commented 3 months ago

We have been in contact with Micron, and there is some good news. They have said they actually test their 8GB sdram with the 4GB refresh rate timing (rather than the slower jedec timings), and so it will be safe to run the 8GB parts with 4GB timing.

This should remove the lower performance observed with 8GB Pi5 compared to 4GB.

Latest pi5 bootloader firmware contains this update (and can get got with rpi-update).

Brunnis commented 3 months ago

@popcornmix That's great. I ran some tests on my 4GB and 8GB boards, using the same image. On average, there were no particular performance discrepancies in Geekbench 5, Passmark or Google Octane v2. There is some variability in the scores between runs and typically the 4GB peaks a bit higher, though.

The only test were there's still a bit more of a difference appears to be Geekbench 6. It's not something I personally would worry about, but reporting it for completeness:

https://browser.geekbench.com/v6/cpu/compare/7160591?baseline=7159462

popcornmix commented 3 months ago

It's interesting that the single core GB6 results also shows the difference. In general there is 4x the sdram bandwidth for multi-core compared to single core, so if sdram is the bottleneck it tends to show more significantly in the multicore results.

The scores for tests with little sdram bandwidth tend to get 4x with multicore. Ray Tracer is a good example of this with 3.98x. The scores for tests that are limited by sdram bandwidth tend to get closer to 1x (or even less due to thrashing). File Compression is a good example of this with .89x.

HTML5 Browser is a surprising result. It is somewhat sdram limited with a multicore speedup of 1.64x. It shows a single core score 17% higher on 4GB vs 8GB. It shows a multi core score 1.7% higher on 4GB vs 8GB.

If it is sdram causing this difference (which seems a reasonable conclusion based on the sdram size being the main differentiator) it seem surprising the difference is much greater in the less loaded, single core sdram case.

It would be interesting if you are able to run this a couple more times and confirm if the 17% vs 1.7% numbers are consistent across runs.

popcornmix commented 3 months ago

Thinking a bit more, the sdram is dual rank, with the top address bit choosing the rank. There is some benefit to accessing both ranks (in terms of keeping sdram pages open).

So a test that spans ranks (i.e. top and bottom halves of memory) may be beneficial.

We know that GB6 fails to run on a 2GB Pi with out-of-memory, and does run on a 4GB Pi. That suggests it (at least in some of its component tests) may utilise both ranks on a 4GB Pi but not an 8GB Pi.

We have actually been working on a scheme to try to avoid this type of inefficient behaviour due to precise locations of buffers. See https://github.com/raspberrypi/linux/pull/6273.

If you are feeling bold, then

sudo rpi-update pulls/6273

should get you the numa feature which should mitigate some of these effects (and produce better benchmark scores, and performance of the system in general). Perhaps give it a go. Backing up first, or testing on a fresh install may be advisable.

by commented 3 months ago

Just on a side note: with this new firmware, my Pi5 now runs completely stable at 3 GHz; before, 2.9 GHz was the maximum for this very board of mine.

Brunnis commented 3 months ago

I've tested the emulated NUMA feature now and it's highly effective. The performance of both the 4GB and 8GB models is improved, but the 8GB significantly more so. With the NUMA feature enabled, both boards perform in a very consistent manner and there is no measureable performance difference between them.

The GB6 results are shown below. I did 5 runs on each configuration. The non-NUMA results are with the latest firmware that uses the same memory timings for the 4GB and 8GB. force_turbo=1 was used. The results match with what you've seen, i.e. +6% in single and +18% in multi composite scores.

bild

Below is a comparison between non-NUMA and NUMA on the 8GB board. As can be seen, several sub-tests increase by ~30%. There are no regressions.

https://browser.geekbench.com/v6/cpu/compare/7189117?baseline=7173406

Out of interest, are you able to give any more background to this? It sounds like this would be something that could affect other platforms as well. Would this be beneficial for non Pi hardware or is it dealing with some peculiarity in your SoCs? And is this not needed or already mitigated on the x86 side?

EDIT: Are there any known negatives of applying this emulated NUMA feature?

EDIT 2: Running some other stuff as well, performance seems to be improved overall. Geekbench 5 is also much faster. Google Octane v2 web browser test is now at over 31k, where previously it was 29k. I have some older results for the WebGL Aquarium test, so not sure if something else has changed, but it now runs at 60 FPS with 1000 fish at 1080p fullscreen. Previously it was ~38 FPS.

popcornmix commented 3 months ago

Good to hear you are seeing the benefit. Every test I've run performs better with NUMA (the extent of the gain depends on how sdram bandwidth constrained the use case was to begin with).

Pi4 and Pi5 show similar benefits.

There is currently a minor issue with NUMA that means the CMA size is limited to a single NUMA region. e.g. CMA>=512M will fail on a 4GB Pi5 with 8 numa regions. That doesn't effect default RPiOS settings, but may if you've manually increased it.

Pi5 doesn't really need CMA, as it has MMUs in front of all hardware blocks (e.g. hevc or hvs), so can use system memory. Pi4 doesn't have the MMUs, so does still need CMA (and 512M may be needed for 4K hevc playback).

A solution for CMA + NUMA is being investigated (and once found will hopefully lead to PR being merged).

I'd expect it to affect some other platforms. I think it will help any "obvious" implementations of sdram controller. Some sdram controllers have some cleverness in, where they scramble the address lines coming in to break up the structure (note, it needs more that just reordering address lines), and they are effectively doing a similar thing to the NUMA interleave in hardware. I'd expect most x86 sdram controllers to be pretty clever.

Brunnis commented 3 months ago

Thanks for the additional info.

Also, I retested the WebGL Aquarium sample. Turns out it runs equally well without NUMA. Maybe some other SW optimization since I last tested? However, I can confirm that compilation runs faster. Compiling dosbox-staging is 10.5 % faster on the 8GB board with NUMA enabled.

raspberrypi / firmware

Raspberry Pi 5 4GB is faster (sometimes significantly so) than the 8GB version #1854