Open dc81c6b5-3a5b-438e-b826-9e7edb3cf487 opened 7 years ago
If we take:
float f(int x[]) { float p = 1;
for (int i = 0; i < 960; i++) p += 1; return p; }
and compile simply with -O and no other flags, we get:
.LCPI0_0: .long 1148207104 # float 961 f: # @f movss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero ret
Extended Description
Consider:
float f(float x[]) { float p = 1.0; for (int i = 0; i < 960; i++) p += 1; return p; }
When compiled with -march=core-avx2 -O3 -ffast-math the assembly loops round adding until it gets to 961.
However:
int f(int x[]) { int p = 1; for (int i = 0; i < 960; i++) p += 1; return p; }
gives:
f: # @f mov eax, 961 ret
I don't know how hard it would be to add the same optimization for floats and double.
As a side note, there are in fact a number of interesting details with the first (float) loop. First, if we reduce the i < 960 limit to i < 959 the loop is optimized out. Second if we change the type to 'double' this upper limit goes down to i < 479. My guess is that this corresponds to an unpeeling cost model that is incorporated into the compiler.