kmaragon / Konscious.Security.Cryptography

MIT License
202 stars 20 forks source link

Intrinsics experimental POC #52

Open TimothyMakkison opened 1 year ago

TimothyMakkison commented 1 year ago

I've been playing around with intrinsics and thought that this project would benefit from parallelization. By adding ModifiedBlake2Intrinsics to parallelize the shuffle I experienced performance increases of 40-55%.

Argon

Without intrinsics

image

With Intrinsics

image

I've only changed Argon but from looking at Blake2Fast you could probably get some performance gains for normal blake usage.

I'm new to intrinsics and haven't added any tests but this should demonstrate the potential benefits.

TimothyMakkison commented 1 year ago

Blake

Beforehand - non Itrinsics version (with reduced memory usage)

image

Intrinsics refactor d90f3a6

image

Using code from @saucecontrol Blake2Fast

image

Adapted Blake2bSimd to use intrinsics in d90f3a6, it's arguably more readable than the Blake2Fast version but is slightly slower, both are at least 4x time faster than the current version.

Changed Blake2bNormal to use stackalloc for v and m. Reducing memory usage to a flat 416 byte.

Blake2bNormal before stackalloc

image

kmaragon commented 1 year ago

I'm just getting some time to look at this repo for the first time in like years. I appreciate this PR so much. This is exactly what I wanted to do with this library from the start. But I was a C++ developer working on .NET core 1.0 on Linux. I'm going to re-open this with your commits squashed while trying to add ARM64/SVE support and tidying up the CI / versioning stuff before publishing. Thank you so much for this!

kmaragon commented 1 year ago

Also adding a note here that .NET8 is beginning to introduce support for AVX512 which will require an update to use but should improve things even more assuming newer gen chips don't suffer from the power draw issues that they historically have with AVX512 extensions.

TimothyMakkison commented 1 year ago

Glad you liked it! 😄

ModifiedBlake2Intrinsics was largely done at 2 AM, it the passes the test but I have no idea if it safe. I'm sure the performance can be improved. I had meant to compare this to a proper SIMD Argon2Id implementations for Rust, C or C++. See- it looks like they have two versions of the diagonalize and G functions, whereas I shuffled the vectors, repeatedy using the same G functons.

Blake2bSimd should be solid as its from SauceControl/Blake2Fast, the only modification is the use of spans.

TimothyMakkison commented 1 year ago

Added the PHC compress function. It passes the tests but the old hacky version appears to be faster?

Hacky version

Method Job EnvironmentVariables Iterations RamKilobytes Mean Error StdDev Median Ratio RatioSD Gen0 Gen1 Gen2 Allocated Alloc Ratio
GetHashAsync Job-VMNFSH COMPlus_EnableSSE2=0 1 65536 80.01 ms 1.597 ms 3.226 ms 79.12 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 64.58 MB 1.00
GetHashAsync Job-VWRWYY Empty 1 65536 42.65 ms 1.023 ms 2.967 ms 42.10 ms 0.55 0.05 1000.0000 1000.0000 1000.0000 64.59 MB 1.00
GetHashAsync Job-VMNFSH COMPlus_EnableSSE2=0 1 73728 90.40 ms 1.767 ms 2.534 ms 89.53 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 72.61 MB 1.00
GetHashAsync Job-VWRWYY Empty 1 73728 46.02 ms 0.917 ms 2.432 ms 45.32 ms 0.52 0.03 1000.0000 1000.0000 1000.0000 72.6 MB 1.00
GetHashAsync Job-VMNFSH COMPlus_EnableSSE2=0 1 81920 101.67 ms 2.030 ms 3.336 ms 100.79 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 80.64 MB 1.00
GetHashAsync Job-VWRWYY Empty 1 81920 52.65 ms 1.482 ms 4.229 ms 51.23 ms 0.54 0.06 1000.0000 1000.0000 1000.0000 80.64 MB 1.00
GetHashAsync Job-VMNFSH COMPlus_EnableSSE2=0 1 90112 108.64 ms 1.813 ms 1.514 ms 108.43 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 88.67 MB 1.00
GetHashAsync Job-VWRWYY Empty 1 90112 54.80 ms 1.080 ms 2.566 ms 54.04 ms 0.51 0.02 1000.0000 1000.0000 1000.0000 88.68 MB 1.00
GetHashAsync Job-VMNFSH COMPlus_EnableSSE2=0 6 65536 381.93 ms 2.475 ms 1.933 ms 382.09 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 64.61 MB 1.00
GetHashAsync Job-VWRWYY Empty 6 65536 162.35 ms 2.922 ms 4.374 ms 162.10 ms 0.42 0.01 1000.0000 1000.0000 1000.0000 64.61 MB 1.00
GetHashAsync Job-VMNFSH COMPlus_EnableSSE2=0 6 73728 438.29 ms 6.539 ms 6.116 ms 437.29 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 72.65 MB 1.00
GetHashAsync Job-VWRWYY Empty 6 73728 184.10 ms 3.537 ms 4.211 ms 182.78 ms 0.42 0.01 1000.0000 1000.0000 1000.0000 72.64 MB 1.00
GetHashAsync Job-VMNFSH COMPlus_EnableSSE2=0 6 81920 483.11 ms 4.978 ms 4.657 ms 483.11 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 80.67 MB 1.00
GetHashAsync Job-VWRWYY Empty 6 81920 199.99 ms 3.996 ms 4.602 ms 199.00 ms 0.42 0.01 1000.0000 1000.0000 1000.0000 80.67 MB 1.00
GetHashAsync Job-VMNFSH COMPlus_EnableSSE2=0 6 90112 532.28 ms 4.697 ms 4.394 ms 532.31 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 88.71 MB 1.00
GetHashAsync Job-VWRWYY Empty 6 90112 218.81 ms 3.116 ms 2.914 ms 218.86 ms 0.41 0.01 1000.0000 1000.0000 1000.0000 88.71 MB 1.00

PHC

Method Job EnvironmentVariables Iterations RamKilobytes Mean Error StdDev Median Ratio RatioSD Gen0 Gen1 Gen2 Allocated Alloc Ratio
GetHashAsync Job-UTPDAE COMPlus_EnableSSE2=0 1 65536 78.49 ms 1.546 ms 2.941 ms 77.60 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 64.57 MB 1.00
GetHashAsync Job-MLBAUK Empty 1 65536 46.31 ms 0.920 ms 2.654 ms 45.58 ms 0.60 0.05 1000.0000 1000.0000 1000.0000 64.57 MB 1.00
GetHashAsync Job-UTPDAE COMPlus_EnableSSE2=0 1 73728 89.21 ms 1.768 ms 2.478 ms 88.82 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 72.61 MB 1.00
GetHashAsync Job-MLBAUK Empty 1 73728 55.02 ms 1.091 ms 2.614 ms 54.52 ms 0.62 0.03 1000.0000 1000.0000 1000.0000 72.62 MB 1.00
GetHashAsync Job-UTPDAE COMPlus_EnableSSE2=0 1 81920 100.68 ms 2.007 ms 3.515 ms 99.52 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 80.64 MB 1.00
GetHashAsync Job-MLBAUK Empty 1 81920 61.48 ms 1.227 ms 2.393 ms 60.59 ms 0.61 0.03 1000.0000 1000.0000 1000.0000 80.62 MB 1.00
GetHashAsync Job-UTPDAE COMPlus_EnableSSE2=0 1 90112 109.50 ms 2.186 ms 3.530 ms 107.99 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 88.67 MB 1.00
GetHashAsync Job-MLBAUK Empty 1 90112 65.04 ms 1.291 ms 2.361 ms 64.42 ms 0.60 0.03 1000.0000 1000.0000 1000.0000 88.65 MB 1.00
GetHashAsync Job-UTPDAE COMPlus_EnableSSE2=0 6 65536 391.36 ms 7.696 ms 11.037 ms 386.65 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 64.6 MB 1.00
GetHashAsync Job-MLBAUK Empty 6 65536 197.04 ms 3.842 ms 4.111 ms 196.55 ms 0.50 0.02 1000.0000 1000.0000 1000.0000 64.61 MB 1.00
GetHashAsync Job-UTPDAE COMPlus_EnableSSE2=0 6 73728 428.48 ms 5.384 ms 4.773 ms 428.75 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 72.64 MB 1.00
GetHashAsync Job-MLBAUK Empty 6 73728 231.74 ms 3.331 ms 3.420 ms 231.15 ms 0.54 0.01 1000.0000 1000.0000 1000.0000 72.64 MB 1.00
GetHashAsync Job-UTPDAE COMPlus_EnableSSE2=0 6 81920 476.48 ms 5.884 ms 5.504 ms 475.71 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 80.67 MB 1.00
GetHashAsync Job-MLBAUK Empty 6 81920 262.77 ms 3.401 ms 3.182 ms 263.35 ms 0.55 0.01 1000.0000 1000.0000 1000.0000 80.65 MB 1.00
GetHashAsync Job-UTPDAE COMPlus_EnableSSE2=0 6 90112 522.82 ms 3.090 ms 2.580 ms 523.01 ms 1.00 0.00 1000.0000 1000.0000 1000.0000 88.7 MB 1.00
GetHashAsync Job-MLBAUK Empty 6 90112 290.81 ms 3.877 ms 3.437 ms 290.16 ms 0.56 0.01 1000.0000 1000.0000 1000.0000 88.73 MB 1.00
kmaragon commented 1 year ago

I think that if I'm reading the numbers right, they're pretty much the same? Which is what I would sort of expect. The implementations actually seem roughly the same overall. The PHC version is just easier to read and link back to the reference implementation.

TimothyMakkison commented 1 year ago

I think that if I'm reading the numbers right, they're pretty much the same? Which is what I would sort of expect. The implementations actually seem roughly the same overall.

Ratio seems to be lower for the older version,. In my runs PHC seemed to be 0.60-0.55 with the older one at 0.55-0.40. I couldn't figure out why the two appeared to be different. After giving it some thought it could be due to a combination of missing AggressiveInlining, loop unrolling and possible JIT funkiness with the rotr methods.

The PHC version is just easier to read and link back to the reference implementation.

100% agree, no idea why I didn't look for the official version sooner.

kmaragon commented 1 year ago

Squashed branch is at https://github.com/kmaragon/Konscious.Security.Cryptography/tree/feature/intrinsics. I'm also calling this 2.0 and getting rid of the ability to explicitly integrate the tasks and just pushing it all into Parallel.ForEach with no async contracts. From the issues it seems like no one is able to use the async contracts. Or maybe it's just that they are the loudest bunch. Either way, I'll remove them entirely.

I've implemented the modifiedblake2 stuff there for ARM. I'd like to get SSE4 in there too for good measure. It'll probably be similar to the ARM NEON implementation. I was looking at saucecontrol's blake2 work and it's strictly for x86. That said, their SSE4 implementation may serve as a reasonable base for AdvSimd support. .NET 8.0 is looking to be adding support for AVX512 but I see no word on ARMv9 SVE2 yet. I expect the latter to be the biggest bump in perf for users on the hardware. Maybe the next gen Apple chips?

saucecontrol commented 1 year ago

Very cool to see this work going on here 👍

I'm following the AVX-512 work in .NET 8 closely and will be using my blake2 project for testing once more of the instructions are available in the API. Keep an eye out for updates this year if you're interested.

Arm SVE in .NET is probably a ways off. They're not prioritizing it at the moment because hardware implementing it isn't widely available.