Use interleaved SHA-NI for SHA-256 on some CPUs lacking AVX-512

solardiz commented 8 months ago

In https://github.com/openwall/john/issues/5435#issuecomment-1943397943 @ukasz wrote:

I wanted to play with sha256 specifically, because I noticed some improvement in the latency of sha-ni instructions on newer architectures (compared to the initial relase) on Agner's fog website. I was wondering how that could stack against john's implementation. Some initial testing showed that due to low register usage on sha256 we should be able to calculate two hashes at once using sha extensions and this results with 1.5x perf compared to calculating just one hash. So in general the question was what is faster 1.5x with sha-ni or john's avx implementation, unfortunately I don't know yet.

As it happens, @alainesp was also experimenting with that just recently:

https://github.com/alainesp/fast-small-crypto

The preliminary results we have suggest that on some AMD CPUs, 2x interleaved SHA-NI can be almost twice faster than 1x, and ~75% faster than AVX2: https://github.com/alainesp/fast-small-crypto/actions/runs/7876924916/job/21491982542 (I only guess that this ran on an AMD CPU, but apparently it's similar to Alain's testing on his known AMD).

However, in my testing of Alain's code on Intel Tiger Lake (11th gen) and building with gcc 11, 1x and 2x SHA-NI are similar speed to each other, and are very slightly slower than AVX2, and almost 3 times slower than AVX-512.

There's no improvement from SHA-NI for SHA-1 anywhere we tested.

@ukasz What CPUs did you see improved latencies for, and what CPU are you testing on? Maybe things improved on newer Intel CPUs. If any of those lack AVX-512, it could be reasonable to use SHA-NI there as well.

I wonder if it would make sense to mix SHA-NI and AVX2 or AVX-512 instructions on any CPUs. I guess this depends on what execution ports these groups of instructions utilize.

Separately, I hear similar instructions for SHA-512 are coming in near future CPUs. I guess those will outperform AVX2, but not necessarily outperform AVX-512.

alainesp commented 8 months ago

I wonder if it would make sense to mix SHA-NI and AVX2 or AVX-512 instructions on any CPUs. I guess this depends on what execution ports these groups of instructions utilize.

I experimented with this and the performance dropped dramatically. It was like 11M/s mixing SHA1-NI with AVX2 when using only AVX2 was like 49M/s.

I also tested 3x interleaving for SHA1-NI and it improves performance by 5% on CLang on my AMD CPU, but it was dependent on the compiler used and I am not sure how it will work on other CPUs with different caches. I didn't try 3x interleaving for SHA256-NI.

claudioandre-br commented 8 months ago

I only guess that this ran on an AMD CPU

I don't have much historical data, but I bet you are correct. All of my last 5 (GitHub Actions) runs have been in:

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 25
model       : 1
model name  : AMD EPYC 7763 64-Core Processor
stepping    : 1
microcode   : 0xffffffff
cpu MHz     : 3241.770
cache size  : 512 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm
bugs        : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso
bogomips    : 4890.85
TLB size    : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management:

ukasz commented 8 months ago

I was testing it on 13900. According to agner.org we have the following latencies in cycles respectively for sha2rnds2, sha2msg1, sha2msg2: AMD zen1-3 - 4,2,3 zen4 - 4,2,5

Intel cannon lake 2,6,13, ice lake 1,6,12 goldmont 8,3,3 tremont 6,3,3

This doesn't match 13900 at all, so I wanted to measure it myself, but I lost access to it for a while, so it will have to wait.

ukasz commented 8 months ago

If I measured it correctly on 14700k it is 3,2,2. 13'th gen should be similar.

openwall / john

Use interleaved SHA-NI for SHA-256 on some CPUs lacking AVX-512 #5437