Closed Shark64 closed 6 months ago
Here are the changes for SM3. I've switched the boolean macros to vpternlog and for the 3-way xor in the body. Two more minor changes ;) : there's no need to "jump the jmp" at the end of the main loop, just loop back if count !=0. Also a couple of vprold reg1, reg1, IMM immediatly followed by vmovups reg2, reg1 can be encoded simply as vprold reg2, reg1, IMM. On my PC this made sm3_mb_vs_ossl_perf go from ~4.4GB/s to ~5.1GB/s :)
Which CPU are you using? That throughput is pretty high! :)
Which CPU are you using? That throughput is pretty high! :)
Rocketlake, an i7-11700k, perhaps it's the low latency DDR4 that helps more than the CPU core itself
Which CPU are you using? That throughput is pretty high! :)
Rocketlake, an i7-11700k, perhaps it's the low latency DDR4 that helps more than the CPU core itself
No, these tests use warm data. You must be using turbo boost, so your CPU frequency goes to 5GHz.
No, these tests use warm data. You must be using turbo boost, so your CPU frequency goes to 5GHz.
Yeah you're right, i hadn't checked turbostat frequency but now i noticed the single core goes up to 5.1GHz for a brief time. So it's the CPU after all :)
Code is now merged, thanks for the work @Shark64!
Here are the changes for SM3. I've switched the boolean macros to vpternlog and for the 3-way xor in the body. Two more minor changes ;) : there's no need to "jump the jmp" at the end of the main loop, just loop back if count !=0. Also a couple of vprold reg1, reg1, IMM immediatly followed by vmovups reg2, reg1 can be encoded simply as vprold reg2, reg1, IMM. On my PC this made sm3_mb_vs_ossl_perf go from ~4.4GB/s to ~5.1GB/s :)