As all the dimensions you're vectorizing are perfectly matching the lane sizes you can move from chunks() to chunks_exact() this avoids a branching operation allowing the compiler to evict cmp, jmp and generating epilog "in case"
Emit a prefetching (x86_64 only - _mm_prefetech ) operation in the most inner loop doing the dequantization from zero point + scaling. It seems the compiler (or the CPU prefetech) is not catching the linear access over qzeros and generates cache misses. With the prefetch instruction we ask to load the next chunk into L1/L2[/L3] registers
Two main things I did:
As all the dimensions you're vectorizing are perfectly matching the lane sizes you can move from chunks() to chunks_exact() this avoids a branching operation allowing the compiler to evict cmp, jmp and generating epilog "in case"
Emit a prefetching (x86_64 only - _mm_prefetech ) operation in the most inner loop doing the dequantization from zero point + scaling. It seems the compiler (or the CPU prefetech) is not catching the linear access over qzeros and generates cache misses. With the prefetch instruction we ask to load the next chunk into L1/L2[/L3] registers