Attempt to go even faster

mfuntowicz commented 10 months ago

Two main things I did:

As all the dimensions you're vectorizing are perfectly matching the lane sizes you can move from chunks() to chunks_exact() this avoids a branching operation allowing the compiler to evict cmp, jmp and generating epilog "in case"
Emit a prefetching (x86_64 only - _mm_prefetech ) operation in the most inner loop doing the dequantization from zero point + scaling. It seems the compiler (or the CPU prefetech) is not catching the linear access over qzeros and generates cache misses. With the prefetch instruction we ask to load the next chunk into L1/L2[/L3] registers

mfuntowicz commented 10 months ago

_(Dont merge this as it as it would certainly breaks other than x8664 build due to prefetch)

srush commented 10 months ago

Closing because portable port moved in.

srush / llama2.rs