Open TimothyMakkison opened 1 year ago
Adapted Blake2bSimd to use intrinsics in d90f3a6, it's arguably more readable than the Blake2Fast version but is slightly slower, both are at least 4x time faster than the current version.
Changed Blake2bNormal to use stackalloc
for v
and m
. Reducing memory usage to a flat 416 byte.
I'm just getting some time to look at this repo for the first time in like years. I appreciate this PR so much. This is exactly what I wanted to do with this library from the start. But I was a C++ developer working on .NET core 1.0 on Linux. I'm going to re-open this with your commits squashed while trying to add ARM64/SVE support and tidying up the CI / versioning stuff before publishing. Thank you so much for this!
Also adding a note here that .NET8 is beginning to introduce support for AVX512 which will require an update to use but should improve things even more assuming newer gen chips don't suffer from the power draw issues that they historically have with AVX512 extensions.
Glad you liked it! 😄
ModifiedBlake2Intrinsics
was largely done at 2 AM, it the passes the test but I have no idea if it safe. I'm sure the performance can be improved.
I had meant to compare this to a proper SIMD Argon2Id implementations for Rust, C or C++. See- it looks like they have two versions of the diagonalize and G functions, whereas I shuffled the vectors, repeatedy using the same G functons.
Blake2bSimd
should be solid as its from SauceControl/Blake2Fast, the only modification is the use of spans.
Added the PHC compress function. It passes the tests but the old hacky version appears to be faster?
Method | Job | EnvironmentVariables | Iterations | RamKilobytes | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GetHashAsync | Job-VMNFSH | COMPlus_EnableSSE2=0 | 1 | 65536 | 80.01 ms | 1.597 ms | 3.226 ms | 79.12 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 64.58 MB | 1.00 |
GetHashAsync | Job-VWRWYY | Empty | 1 | 65536 | 42.65 ms | 1.023 ms | 2.967 ms | 42.10 ms | 0.55 | 0.05 | 1000.0000 | 1000.0000 | 1000.0000 | 64.59 MB | 1.00 |
GetHashAsync | Job-VMNFSH | COMPlus_EnableSSE2=0 | 1 | 73728 | 90.40 ms | 1.767 ms | 2.534 ms | 89.53 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 72.61 MB | 1.00 |
GetHashAsync | Job-VWRWYY | Empty | 1 | 73728 | 46.02 ms | 0.917 ms | 2.432 ms | 45.32 ms | 0.52 | 0.03 | 1000.0000 | 1000.0000 | 1000.0000 | 72.6 MB | 1.00 |
GetHashAsync | Job-VMNFSH | COMPlus_EnableSSE2=0 | 1 | 81920 | 101.67 ms | 2.030 ms | 3.336 ms | 100.79 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 80.64 MB | 1.00 |
GetHashAsync | Job-VWRWYY | Empty | 1 | 81920 | 52.65 ms | 1.482 ms | 4.229 ms | 51.23 ms | 0.54 | 0.06 | 1000.0000 | 1000.0000 | 1000.0000 | 80.64 MB | 1.00 |
GetHashAsync | Job-VMNFSH | COMPlus_EnableSSE2=0 | 1 | 90112 | 108.64 ms | 1.813 ms | 1.514 ms | 108.43 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 88.67 MB | 1.00 |
GetHashAsync | Job-VWRWYY | Empty | 1 | 90112 | 54.80 ms | 1.080 ms | 2.566 ms | 54.04 ms | 0.51 | 0.02 | 1000.0000 | 1000.0000 | 1000.0000 | 88.68 MB | 1.00 |
GetHashAsync | Job-VMNFSH | COMPlus_EnableSSE2=0 | 6 | 65536 | 381.93 ms | 2.475 ms | 1.933 ms | 382.09 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 64.61 MB | 1.00 |
GetHashAsync | Job-VWRWYY | Empty | 6 | 65536 | 162.35 ms | 2.922 ms | 4.374 ms | 162.10 ms | 0.42 | 0.01 | 1000.0000 | 1000.0000 | 1000.0000 | 64.61 MB | 1.00 |
GetHashAsync | Job-VMNFSH | COMPlus_EnableSSE2=0 | 6 | 73728 | 438.29 ms | 6.539 ms | 6.116 ms | 437.29 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 72.65 MB | 1.00 |
GetHashAsync | Job-VWRWYY | Empty | 6 | 73728 | 184.10 ms | 3.537 ms | 4.211 ms | 182.78 ms | 0.42 | 0.01 | 1000.0000 | 1000.0000 | 1000.0000 | 72.64 MB | 1.00 |
GetHashAsync | Job-VMNFSH | COMPlus_EnableSSE2=0 | 6 | 81920 | 483.11 ms | 4.978 ms | 4.657 ms | 483.11 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 80.67 MB | 1.00 |
GetHashAsync | Job-VWRWYY | Empty | 6 | 81920 | 199.99 ms | 3.996 ms | 4.602 ms | 199.00 ms | 0.42 | 0.01 | 1000.0000 | 1000.0000 | 1000.0000 | 80.67 MB | 1.00 |
GetHashAsync | Job-VMNFSH | COMPlus_EnableSSE2=0 | 6 | 90112 | 532.28 ms | 4.697 ms | 4.394 ms | 532.31 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 88.71 MB | 1.00 |
GetHashAsync | Job-VWRWYY | Empty | 6 | 90112 | 218.81 ms | 3.116 ms | 2.914 ms | 218.86 ms | 0.41 | 0.01 | 1000.0000 | 1000.0000 | 1000.0000 | 88.71 MB | 1.00 |
Method | Job | EnvironmentVariables | Iterations | RamKilobytes | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GetHashAsync | Job-UTPDAE | COMPlus_EnableSSE2=0 | 1 | 65536 | 78.49 ms | 1.546 ms | 2.941 ms | 77.60 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 64.57 MB | 1.00 |
GetHashAsync | Job-MLBAUK | Empty | 1 | 65536 | 46.31 ms | 0.920 ms | 2.654 ms | 45.58 ms | 0.60 | 0.05 | 1000.0000 | 1000.0000 | 1000.0000 | 64.57 MB | 1.00 |
GetHashAsync | Job-UTPDAE | COMPlus_EnableSSE2=0 | 1 | 73728 | 89.21 ms | 1.768 ms | 2.478 ms | 88.82 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 72.61 MB | 1.00 |
GetHashAsync | Job-MLBAUK | Empty | 1 | 73728 | 55.02 ms | 1.091 ms | 2.614 ms | 54.52 ms | 0.62 | 0.03 | 1000.0000 | 1000.0000 | 1000.0000 | 72.62 MB | 1.00 |
GetHashAsync | Job-UTPDAE | COMPlus_EnableSSE2=0 | 1 | 81920 | 100.68 ms | 2.007 ms | 3.515 ms | 99.52 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 80.64 MB | 1.00 |
GetHashAsync | Job-MLBAUK | Empty | 1 | 81920 | 61.48 ms | 1.227 ms | 2.393 ms | 60.59 ms | 0.61 | 0.03 | 1000.0000 | 1000.0000 | 1000.0000 | 80.62 MB | 1.00 |
GetHashAsync | Job-UTPDAE | COMPlus_EnableSSE2=0 | 1 | 90112 | 109.50 ms | 2.186 ms | 3.530 ms | 107.99 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 88.67 MB | 1.00 |
GetHashAsync | Job-MLBAUK | Empty | 1 | 90112 | 65.04 ms | 1.291 ms | 2.361 ms | 64.42 ms | 0.60 | 0.03 | 1000.0000 | 1000.0000 | 1000.0000 | 88.65 MB | 1.00 |
GetHashAsync | Job-UTPDAE | COMPlus_EnableSSE2=0 | 6 | 65536 | 391.36 ms | 7.696 ms | 11.037 ms | 386.65 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 64.6 MB | 1.00 |
GetHashAsync | Job-MLBAUK | Empty | 6 | 65536 | 197.04 ms | 3.842 ms | 4.111 ms | 196.55 ms | 0.50 | 0.02 | 1000.0000 | 1000.0000 | 1000.0000 | 64.61 MB | 1.00 |
GetHashAsync | Job-UTPDAE | COMPlus_EnableSSE2=0 | 6 | 73728 | 428.48 ms | 5.384 ms | 4.773 ms | 428.75 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 72.64 MB | 1.00 |
GetHashAsync | Job-MLBAUK | Empty | 6 | 73728 | 231.74 ms | 3.331 ms | 3.420 ms | 231.15 ms | 0.54 | 0.01 | 1000.0000 | 1000.0000 | 1000.0000 | 72.64 MB | 1.00 |
GetHashAsync | Job-UTPDAE | COMPlus_EnableSSE2=0 | 6 | 81920 | 476.48 ms | 5.884 ms | 5.504 ms | 475.71 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 80.67 MB | 1.00 |
GetHashAsync | Job-MLBAUK | Empty | 6 | 81920 | 262.77 ms | 3.401 ms | 3.182 ms | 263.35 ms | 0.55 | 0.01 | 1000.0000 | 1000.0000 | 1000.0000 | 80.65 MB | 1.00 |
GetHashAsync | Job-UTPDAE | COMPlus_EnableSSE2=0 | 6 | 90112 | 522.82 ms | 3.090 ms | 2.580 ms | 523.01 ms | 1.00 | 0.00 | 1000.0000 | 1000.0000 | 1000.0000 | 88.7 MB | 1.00 |
GetHashAsync | Job-MLBAUK | Empty | 6 | 90112 | 290.81 ms | 3.877 ms | 3.437 ms | 290.16 ms | 0.56 | 0.01 | 1000.0000 | 1000.0000 | 1000.0000 | 88.73 MB | 1.00 |
I think that if I'm reading the numbers right, they're pretty much the same? Which is what I would sort of expect. The implementations actually seem roughly the same overall. The PHC version is just easier to read and link back to the reference implementation.
I think that if I'm reading the numbers right, they're pretty much the same? Which is what I would sort of expect. The implementations actually seem roughly the same overall.
Ratio seems to be lower for the older version,. In my runs PHC seemed to be 0.60-0.55 with the older one at 0.55-0.40.
I couldn't figure out why the two appeared to be different. After giving it some thought it could be due to a combination of missing AggressiveInlining
, loop unrolling and possible JIT funkiness with the rotr methods.
The PHC version is just easier to read and link back to the reference implementation.
100% agree, no idea why I didn't look for the official version sooner.
Squashed branch is at https://github.com/kmaragon/Konscious.Security.Cryptography/tree/feature/intrinsics. I'm also calling this 2.0 and getting rid of the ability to explicitly integrate the tasks and just pushing it all into Parallel.ForEach with no async contracts. From the issues it seems like no one is able to use the async contracts. Or maybe it's just that they are the loudest bunch. Either way, I'll remove them entirely.
I've implemented the modifiedblake2 stuff there for ARM. I'd like to get SSE4 in there too for good measure. It'll probably be similar to the ARM NEON implementation. I was looking at saucecontrol's blake2 work and it's strictly for x86. That said, their SSE4 implementation may serve as a reasonable base for AdvSimd support. .NET 8.0 is looking to be adding support for AVX512 but I see no word on ARMv9 SVE2 yet. I expect the latter to be the biggest bump in perf for users on the hardware. Maybe the next gen Apple chips?
Very cool to see this work going on here 👍
I'm following the AVX-512 work in .NET 8 closely and will be using my blake2 project for testing once more of the instructions are available in the API. Keep an eye out for updates this year if you're interested.
Arm SVE in .NET is probably a ways off. They're not prioritizing it at the moment because hardware implementing it isn't widely available.
I've been playing around with intrinsics and thought that this project would benefit from parallelization. By adding ModifiedBlake2Intrinsics to parallelize the shuffle I experienced performance increases of 40-55%.
Argon
Without intrinsics
With Intrinsics
I've only changed Argon but from looking at Blake2Fast you could probably get some performance gains for normal blake usage.
I'm new to intrinsics and haven't added any tests but this should demonstrate the potential benefits.