philipturner / metal-benchmarks

Apple GPU microarchitecture
MIT License
454 stars 16 forks source link

Breaking down core type is also important #5

Open alecazam opened 1 week ago

alecazam commented 1 week ago

I like to use a big/med/little notation due to Android, but iOS only has big/little (power/efficiency). But would be useful in your stats to break down the cpu core counts. For example, A7 through A10 can't really run big and little cores together. It's an either or, but it was the A11 that can. Thanks for the great stats.

Otherwise saying A7-A10 have 4 cores is erroneous. They really are dual core systems, with A11 being the first quad core due to simultaneous use of power + efficiency cores.

These are the stats on CPU A7-A9 2/0 A10 2/2 (still dual-core, can't use P + E) A11-A17 2/4

M1 4/4, 6/2, and 8/2, 8/2, 16/4 M2 4/4, 6/4 and 8/4, 8/4, 16/8 M3 4/4, 5/4 and 5/4, 10/4 and 12/4, no Ultra

M4 (iPad) 3/6 and 4/6 (similar MT perf to M3 Pro)

And apologies, but maybe the "Cores" listed are for the GPU?

philipturner commented 1 week ago

These are GPU cores. I use the word "core" to refer to both (performance-)CPU and GPU cores. That's because, I discovered some striking similarities at the hardware level. Both Firestorm (performance-)CPU and GPU cores can read 32 bytes per core-cycle from main memory. They also consume a similar amount of area on the silicon die. There's more to it, at a fundamental level they are so similar. It's unfortunate that other vendors call them "SM" or "CU" when their true nature is this way.

Otherwise saying A7-A10 have 4 cores is erroneous. They really are dual core systems, with A11 being the first quad core due to simultaneous use of power + efficiency cores.

Efficiency cores are basically half of a CPU core. Instead of a 512-bit vector execution width, they are 256-bit. Just like how Intel decreases the number of ALUs on their lower-end CPUs, from 2x256-bit (AVX2, 512-bit) to 2x128-bit (SSE, 256-bit)*. Some of the earlier PowerVR core designs would be equivalent to an "efficiency" GPU core with half the ALU count of modern cores. For example, the A10X with 12 "cores" is effectively an M1 chip with 6 GPU cores and the FP32 throughput nerfed by 2x. Run the numbers, and you can predict the A10X's FP32 GFLOPS by dividing 2617 GFLOPS by 2x and then 8/6x.

*Their AVX-512 server Xeon cores are 2x512-bit (actually 1024 bits of throughput per core-cycle, in SIMD vector speak).

Modern GPU cores with 128 32-bit ALUs are 4096-bit. 128 x 32-bit = 4096-bit. But they run at a third of clock speed (GHz) of CPUs with 512-bit width. Instead of 8x the GFLOPS of a CPU core, they are more like 2.67x the GFLOPS. Let's validate this model.

Not bad. Off by less than a factor of 2. The correct answer is 10616 GFLOPS. We can attribute the difference to the model being extremely crude. But it gets the point across.

philipturner commented 1 week ago

If you look up the Tachyum Prodigy, it provides another variation of "core". They have chips with 128 cores. Each has 2048-bit vector throughput (2x1024-bit) and 5.7 GHz clock speed.

// Gigainstructions per second
GINSTRS = width * clock speed / number of bits per scalar
51 GINSTRS = 512 bits * 3.2 gigacycles/second / 32 bits per scalar

// Giga floating point operations per second
If an instruction is the FFMA32 instruction, people arbitrarily say
it is "two" operations. I don't like this way of modeling performance.
51 GINSTRS * 2 = 102 GFLOPS
Processor Width Clock Speed GINSTRS (FP32) GFLOPS (FP32)
Apple Efficiency 256-bit 2.0 GHz 16 32
Apple Power 512-bit 3.2 GHz 51 102
Apple GPU 4096-bit 1.3 GHz 166 332
Intel i3/i5 256-bit 4.0 GHz 32 64
Intel i7/i9 512-bit 5.0 GHz 80 160
Intel Xeon 1024-bit 2.5 GHz 80 160
Intel GPU (Gen9) 2048-bit 1.1 GHz 70 140
Intel GPU (Arc) 4096-bit?
Tachyum 2048-bit 5.7 GHz 365 730

[M1 Max CPU] If you multiply 102 GFLOPS by 8, you get 800 GFLOPS

[M1 Max GPU] If you multiply 332 GFLOPS by 32, you get 10600 GFLOPS.

[Tachyum Prodigy] If you multiply 730 GFLOPS by 128, you get 93440 GFLOPS. This is how a CPU can get ~93 TeraFLOPS of compute power and compete with GPUs.

This model only applies to vector throughput. To understand the Outer Product Engine (Apple AMX), the explanation will get more complex. And the tensor cores on Nvidia GPUs, where I have no idea WTF they are doing (nor do I care).

philipturner commented 1 week ago

Apple Icestorm: 2x128-bit = 256 bit

Apple Firestorm: 4x128-bit = 512 bit

Apple GPU: 4x1024-bit = 4096 bit

The Apple GPU can issue an instruction from four SIMDs per core-cycle. Each SIMD has 32 threads, each carrying a 32-bit scalar. 4 x (32 x 32) = 4096. The CPU is similar, but uses out-of-order execution to issue from four NEON units per core-cycle. Except on the nerfed efficiency cores, with two vector units. Each NEON instruction is four 32-bit scalars. 4 x (4 x 32) = 512. The CPU has smaller granularity for divergence (4 threads per SIMD), if you think about it like a GPU. Intel CPUs have coarser granularity (8-wide vectors of IEEE32 on consumer chips). The weakness of GPUs, is you need a lot of elements with very similar control flow, such as stupidly parallel AI models. Otherwise there's a (0.50)^32 = 0.01 chance that they all take the same control flow branch.

alecazam commented 1 week ago

Very interesting. I just find big/little cpu core identification useful for scheduling algorithms in my game titles. I was really searching for raytracing perf counts, but noted that you don't have M3 yet. But there seems to be no sources of ray trace perf on the Apple systems. Apple says A18 is up to 2x more RT than A17, but if one doesn't know the A17 perf then it's all relative to nothing. Would be good to know where Apple gpus lie with respect to their desktop/power-hungry counterparts.

philipturner commented 1 week ago

I don't have any use for hardware-accelerated triangle intersections. I do ray tracing with ray-sphere intersections, which is very different from triangle meshes. Not even NV has hardware support for ray-sphere intersections.

but noted that you don't have M3 yet.

I have an M4 iPad Pro and A17 iPhone. I was hired to do this (the work on FlashAttention), but not out of personal interest. Maybe you'll find more information about the M3+ family in these sources.