Intrinsics experimental POC

TimothyMakkison commented 1 year ago

I've been playing around with intrinsics and thought that this project would benefit from parallelization. By adding ModifiedBlake2Intrinsics to parallelize the shuffle I experienced performance increases of 40-55%.

Argon

Without intrinsics

With Intrinsics

I've only changed Argon but from looking at Blake2Fast you could probably get some performance gains for normal blake usage.

I'm new to intrinsics and haven't added any tests but this should demonstrate the potential benefits.

TimothyMakkison commented 1 year ago

Blake

Beforehand - non Itrinsics version (with reduced memory usage)

Intrinsics refactor d90f3a6

Using code from @saucecontrol Blake2Fast

Adapted Blake2bSimd to use intrinsics in d90f3a6, it's arguably more readable than the Blake2Fast version but is slightly slower, both are at least 4x time faster than the current version.

Changed Blake2bNormal to use stackalloc for v and m. Reducing memory usage to a flat 416 byte.

Blake2bNormal before stackalloc

kmaragon commented 1 year ago

I'm just getting some time to look at this repo for the first time in like years. I appreciate this PR so much. This is exactly what I wanted to do with this library from the start. But I was a C++ developer working on .NET core 1.0 on Linux. I'm going to re-open this with your commits squashed while trying to add ARM64/SVE support and tidying up the CI / versioning stuff before publishing. Thank you so much for this!

kmaragon commented 1 year ago

Also adding a note here that .NET8 is beginning to introduce support for AVX512 which will require an update to use but should improve things even more assuming newer gen chips don't suffer from the power draw issues that they historically have with AVX512 extensions.

TimothyMakkison commented 1 year ago

Glad you liked it! 😄

ModifiedBlake2Intrinsics was largely done at 2 AM, it the passes the test but I have no idea if it safe. I'm sure the performance can be improved. I had meant to compare this to a proper SIMD Argon2Id implementations for Rust, C or C++. See- it looks like they have two versions of the diagonalize and G functions, whereas I shuffled the vectors, repeatedy using the same G functons.

Blake2bSimd should be solid as its from SauceControl/Blake2Fast, the only modification is the use of spans.

TimothyMakkison commented 1 year ago

Added the PHC compress function. It passes the tests but the old hacky version appears to be faster?

Hacky version

Method	Job	EnvironmentVariables	Iterations	RamKilobytes	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	1	65536	80.01 ms	1.597 ms	3.226 ms	79.12 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	64.58 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	1	65536	42.65 ms	1.023 ms	2.967 ms	42.10 ms	0.55	0.05	1000.0000	1000.0000	1000.0000	64.59 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	1	73728	90.40 ms	1.767 ms	2.534 ms	89.53 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	72.61 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	1	73728	46.02 ms	0.917 ms	2.432 ms	45.32 ms	0.52	0.03	1000.0000	1000.0000	1000.0000	72.6 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	1	81920	101.67 ms	2.030 ms	3.336 ms	100.79 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	80.64 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	1	81920	52.65 ms	1.482 ms	4.229 ms	51.23 ms	0.54	0.06	1000.0000	1000.0000	1000.0000	80.64 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	1	90112	108.64 ms	1.813 ms	1.514 ms	108.43 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	88.67 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	1	90112	54.80 ms	1.080 ms	2.566 ms	54.04 ms	0.51	0.02	1000.0000	1000.0000	1000.0000	88.68 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	6	65536	381.93 ms	2.475 ms	1.933 ms	382.09 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	64.61 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	6	65536	162.35 ms	2.922 ms	4.374 ms	162.10 ms	0.42	0.01	1000.0000	1000.0000	1000.0000	64.61 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	6	73728	438.29 ms	6.539 ms	6.116 ms	437.29 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	72.65 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	6	73728	184.10 ms	3.537 ms	4.211 ms	182.78 ms	0.42	0.01	1000.0000	1000.0000	1000.0000	72.64 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	6	81920	483.11 ms	4.978 ms	4.657 ms	483.11 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	80.67 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	6	81920	199.99 ms	3.996 ms	4.602 ms	199.00 ms	0.42	0.01	1000.0000	1000.0000	1000.0000	80.67 MB	1.00

GetHashAsync	Job-VMNFSH	COMPlus_EnableSSE2=0	6	90112	532.28 ms	4.697 ms	4.394 ms	532.31 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	88.71 MB	1.00
GetHashAsync	Job-VWRWYY	Empty	6	90112	218.81 ms	3.116 ms	2.914 ms	218.86 ms	0.41	0.01	1000.0000	1000.0000	1000.0000	88.71 MB	1.00

PHC

Method	Job	EnvironmentVariables	Iterations	RamKilobytes	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	1	65536	78.49 ms	1.546 ms	2.941 ms	77.60 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	64.57 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	1	65536	46.31 ms	0.920 ms	2.654 ms	45.58 ms	0.60	0.05	1000.0000	1000.0000	1000.0000	64.57 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	1	73728	89.21 ms	1.768 ms	2.478 ms	88.82 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	72.61 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	1	73728	55.02 ms	1.091 ms	2.614 ms	54.52 ms	0.62	0.03	1000.0000	1000.0000	1000.0000	72.62 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	1	81920	100.68 ms	2.007 ms	3.515 ms	99.52 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	80.64 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	1	81920	61.48 ms	1.227 ms	2.393 ms	60.59 ms	0.61	0.03	1000.0000	1000.0000	1000.0000	80.62 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	1	90112	109.50 ms	2.186 ms	3.530 ms	107.99 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	88.67 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	1	90112	65.04 ms	1.291 ms	2.361 ms	64.42 ms	0.60	0.03	1000.0000	1000.0000	1000.0000	88.65 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	6	65536	391.36 ms	7.696 ms	11.037 ms	386.65 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	64.6 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	6	65536	197.04 ms	3.842 ms	4.111 ms	196.55 ms	0.50	0.02	1000.0000	1000.0000	1000.0000	64.61 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	6	73728	428.48 ms	5.384 ms	4.773 ms	428.75 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	72.64 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	6	73728	231.74 ms	3.331 ms	3.420 ms	231.15 ms	0.54	0.01	1000.0000	1000.0000	1000.0000	72.64 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	6	81920	476.48 ms	5.884 ms	5.504 ms	475.71 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	80.67 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	6	81920	262.77 ms	3.401 ms	3.182 ms	263.35 ms	0.55	0.01	1000.0000	1000.0000	1000.0000	80.65 MB	1.00

GetHashAsync	Job-UTPDAE	COMPlus_EnableSSE2=0	6	90112	522.82 ms	3.090 ms	2.580 ms	523.01 ms	1.00	0.00	1000.0000	1000.0000	1000.0000	88.7 MB	1.00
GetHashAsync	Job-MLBAUK	Empty	6	90112	290.81 ms	3.877 ms	3.437 ms	290.16 ms	0.56	0.01	1000.0000	1000.0000	1000.0000	88.73 MB	1.00

kmaragon commented 1 year ago

I think that if I'm reading the numbers right, they're pretty much the same? Which is what I would sort of expect. The implementations actually seem roughly the same overall. The PHC version is just easier to read and link back to the reference implementation.

TimothyMakkison commented 1 year ago

I think that if I'm reading the numbers right, they're pretty much the same? Which is what I would sort of expect. The implementations actually seem roughly the same overall.

Ratio seems to be lower for the older version,. In my runs PHC seemed to be 0.60-0.55 with the older one at 0.55-0.40. I couldn't figure out why the two appeared to be different. After giving it some thought it could be due to a combination of missing AggressiveInlining, loop unrolling and possible JIT funkiness with the rotr methods.

The PHC version is just easier to read and link back to the reference implementation.

100% agree, no idea why I didn't look for the official version sooner.

kmaragon commented 1 year ago

Squashed branch is at https://github.com/kmaragon/Konscious.Security.Cryptography/tree/feature/intrinsics. I'm also calling this 2.0 and getting rid of the ability to explicitly integrate the tasks and just pushing it all into Parallel.ForEach with no async contracts. From the issues it seems like no one is able to use the async contracts. Or maybe it's just that they are the loudest bunch. Either way, I'll remove them entirely.

I've implemented the modifiedblake2 stuff there for ARM. I'd like to get SSE4 in there too for good measure. It'll probably be similar to the ARM NEON implementation. I was looking at saucecontrol's blake2 work and it's strictly for x86. That said, their SSE4 implementation may serve as a reasonable base for AdvSimd support. .NET 8.0 is looking to be adding support for AVX512 but I see no word on ARMv9 SVE2 yet. I expect the latter to be the biggest bump in perf for users on the hardware. Maybe the next gen Apple chips?

saucecontrol commented 1 year ago

Very cool to see this work going on here 👍

I'm following the AVX-512 work in .NET 8 closely and will be using my blake2 project for testing once more of the instructions are available in the API. Keep an eye out for updates this year if you're interested.

Arm SVE in .NET is probably a ways off. They're not prioritizing it at the moment because hardware implementing it isn't widely available.

kmaragon / Konscious.Security.Cryptography