performance_matrix_multiplication, the actual peak flops is much higher than the theoretical peak flops

halide / Halide

a language for fast, portable data-parallel computation

https://halide-lang.org

Other

5.91k stars 1.07k forks source link

performance_matrix_multiplication, the actual peak flops is much higher than the theoretical peak flops #8480

Closed Bbean closed 2 days ago

Bbean commented 3 days ago

When testing performance_matrix_multiplication, the actual peak flops is much higher than the theoretical peak flops. I work on Intel Core i5-8500B. Coffee Lake architecture, has one 256-bit FMA (Fused Multiply-Add) unit, which can execute 8 single-precision floating-point operations each time, The theoretical peak value should be 6 cores × 3.0 GHz × 8 FLOPs/cycle * 2 (mul + add) = 288 GFlops. However, the actual peak value of the calculation in the output I got is Halide: 3.631948ms, 537.558090 GFLOP/s, which is even higher than the theoretical peak flops after turbo (4.1 GHz). I'd like to know where the possible problem lies. Thanks for any possible answers.

mcourteaux commented 2 days ago

On the Coffee Lake block diagram, I see 2 FP FMA units:

Rerunning the math, but with 2 FP FMA's doubles the number of flops, yielding 576 GFlops, which is above what Halide reported. Halide achieves 93.3% of theoretical peak, with these calculations, at 3GHz.

Bbean commented 2 days ago

That's a reasonable explanation, Thank you for your reply. 93.3% is a very impressive hit.

mcourteaux commented 2 days ago

Thinking more about this, I looked up the throughput on uops.info, and it's advertised as 0.5 cycles per instruction for the FMA operations on Coffee Lake. That would mean that the theoretical peak FLOPS doubles again, to 1152GFlops. So that would mean Halide only achieves 46% of peak, which then again seems low compared to the comments in the source code of that test:

https://github.com/halide/Halide/blob/ff12d4a23914361be31b51f9b924a1725be4781e/test/performance/matrix_multiplication.cpp#L43-L48

Let's ask @zvookin or @abadams to weigh in on this. Andrew recently even worked on the schedule of this test.

abadams commented 2 days ago

I believe 0.5 cycles per instruction is another way of saying "there are two FMA units", so both of those doublings are actually the same doubling.

mcourteaux commented 2 days ago

Ah dang, that's actually super sensible. Thanks for pointing that out. Does 93% of peak sound reasonable to you?

abadams commented 2 days ago

Yes. The correct inner loop for a matrix multiply is a standard known thing. You just have to tile such that things are limited by the number of fmas rather than the number of loads. Coffee Lake can do two aligned loads per cycle and two fmas per cycle, so there are lots of options that work. It was harder when you could only do one load per cycle. Given that, how close to peak you reach is largely determined by how you use the cache hierarchy. That CPU has 9 MB of L3. The two matrices involved add up to ~8MB, so they stay in L3. With any kind of sensible blocking over the i,j,k axes to get things into L2 and L1 at appropriate times, this makes it possible to keep the fma units fed. Blocking over the k axis is a bit annoying because it requires factoring the reduction.

If you make the matrices twice as large so they fall out of L3 performance is probably going to be worse. Matrix multiplication out of dram is a harder problem.

Bbean commented 2 hours ago

@abadams and I found that it was really hard to understand this new schedule. I used compile_to_c and may know the actual calculation process, two intermediate buffers, blocking over the k axis. But the connection between this calculation process and the schedule got me stuck. I didn't figure out what the sequence of thinking between them was.

out.tile...
matrix_mul.update...
intm.compute_at...
intm.update...
matrix_mul.compute_at...
matrix_mul.update...
out.bound...

Could you give me some hints on the thinking process?