Open munroesj52 opened 4 years ago
@munroesj52
You have questions or is this just a ping?
Ok this looks bad?
size_t Rijndael_Enc_AdvancedProcessBlocks128_6x1_ALTIVEC(const word32 *subKeys, size_t rounds,
const byte *inBlocks, const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
{
return AdvancedProcessBlocks128_6x1_ALTIVEC(POWER8_Enc_Block, POWER8_Enc_6_Blocks,
subKeys, rounds, inBlocks, xorBlocks, outBlocks, length, flags);
}
etc passing pointers to thunks which are then called from:
template <typename F1, typename F6, typename W>
inline size_t AdvancedProcessBlocks128_6x1_ALTIVEC(F1 func1, F6 func6,
const W *subKeys, size_t rounds, const byte *inBlocks,
const byte *xorBlocks, byte *outBlocks, size_t length, word32 flags)
as
func1(block, subKeys, rounds);
and
func6(block0, block1, block2, block3, block4, block5, subKeys, rounds);
These should be inlined into AdvancedProcessBlocks128_6x1_ALTIVEC_PWR8
so that the compiler and the POWER8 super-scalar, out-of-order processor can effectively queue up the data and keep its pipelines (there are 16 of them) filled.
And add attribute((flatten)), attribute((optimize ("unroll-loops")))
on top of that!
Compiled cryptopp for POWER8 (Ubuntu 18.04) and profiled (perf record) cryptest b.
Rijndael_Enc_AdvancedProcessBlocks128_6x1_ALTIVEC
2nd in the list at 4.36% (Baseline_Multiply16 is #1 @ 5.74%).
The vcryper/vcypherlast barely register at ~9,5% (of 4.36%) The rest is data fumbling (load/store/permute). Plus a lot of branchy code dealing with data alignment, A place to start is to pass the parms in registers (the ABI allows up to 12 vector reg parms, including small arrays and structs up to 8 registers each) and move the data handling into the driver loop. This might allow some loop-unrolling and load look-ahead in the driver functions.
Also looked at Baseline_Multiply16, Its not the multiplies. The sums are taking the time and there is only one carry bit in the XER. POWER9 adds a second carry but its a bit awkward to use (has to be cleared before use).
Take a look at PVECLIB . Especially the quadword multiplies and multiple quadword precision multiplies vec_muludq and vec_mul512x512
Your performance problems may be related to load/store and not the crypto operations.The SHA ops are all listed in the User Manual as 2 Cycles and AES as 6-7 cycles.
Make sure your compile is actually inlining the ops. For GCC use attribute((flatten)), Also you may need to unroll you looks a bit. For GCC use attribute((optimize ("unroll-loops")))
If still disappointed you can use performance tools performance simulator and PipeStat to find the bottlenecks.