Open mhaeuser opened 5 years ago
Curious @Download-Fritz have you run any speed tests to measure/monitor differentials from your optimizing upgrade?
Good day,
Sorry, not for this specifically, this was just one of many changes I had been performing to multiple origin codes simultaneously. I'm not even sure whether I didn't actually optimise more in the end, unfortunately my usage purpose required hefty modifications that didn't allow me to keep it in a shape where I could PR changes. If you're curious, you can find it here: https://github.com/acidanthera/OcSupportPkg/tree/master/Library/OcCryptoLib (the license is project convention, I'm fine with my changes being considered CC0) - thanks again for your work.
Generally though, I faced significant speed boosts from eliminating BIGNUM operations in other places such as Montgomery parameter calculation, where my code went from overly significant (I think about 0.3 s?) to insignificant (I think about 0.002 s? hot functions were all conhost). For sufficiently large numbers / number of operations, I think it can be worth it.
Also, I just realised using size_t assumes a flat memory model, you might want to use uintptr_t instead.
Sorry for the mess here, I'm a bit busy, but I'd still like to contribute something back from my changes here indirectly. I was looking to find an efficient modulo algorithm to replace the code I took from this project and ended up finding a vague description I then implemented. As it shaped, I realised it is basically your division algorithm. There are two points here:
1) Algorithmic optimization: I performed a couple of optimisations. One is replacing the "Current" BN with an integer as initially described, but I also sped up shifting by "skipping forward": https://github.com/acidanthera/OcSupportPkg/blob/3e9ef5ac412d94ff5999c881e64681bd8864bed2/Library/OcCryptoLib/BigNumPrimitives.c#L639 and https://github.com/acidanthera/OcSupportPkg/blob/3e9ef5ac412d94ff5999c881e64681bd8864bed2/Library/OcCryptoLib/BigNumPrimitives.c#L689. The optimisation is proven here: https://github.com/acidanthera/OcSupportPkg/blob/3e9ef5ac412d94ff5999c881e64681bd8864bed2/Library/OcCryptoLib/BigNumPrimitives.c#L632 (in context of the invariant: https://github.com/acidanthera/OcSupportPkg/blob/3e9ef5ac412d94ff5999c881e64681bd8864bed2/Library/OcCryptoLib/BigNumPrimitives.c#L624)
2) The modulus: Honestly, looking back, it is obvious, but I'd lie if I said this came to me while I worked with the existing algorithms, haha... Your existing division algorithm computes both the quotient and the modulus at the same time, while unfortunately latter is scrapped and re-computed from the quotient here: https://github.com/kokke/tiny-bignum-c/blob/master/bn.c#L414 The modulus is stored in bignum_div's "tmp" after the loop has terminated. I have proven this here: https://github.com/acidanthera/OcSupportPkg/blob/3e9ef5ac412d94ff5999c881e64681bd8864bed2/Library/OcCryptoLib/BigNumPrimitives.c#L743 This saves you a multiplication and a subtraction per modulo operation, sounds like a good deal.
Thank you for your project.
Hi @Download-Fritz and sorry for my blatant absence in this discussion.
Thank you for the contribution - I am really happy to see that the project is of use to others :) And thank you for the explanations of how to implement the proposed optimizations.
I have a busy schedule (don't we all?), so I will probably not get these changes implemented straight away - especially since I have to think a little first, and do some work myself - it's much easier to just merge a PR ;)
I will look into implementing the optimizations you suggest - they should provide a good performance boost.
The current division algorithm uses a BIGNUM structure for 'current' while its task can be completed with natural integer arithmetics. In my local fork (sorry, it's still a hot mess), I have changed the algorithm as follows, where several names and small things have been altered, but should still be recognizable fine:
Along with this newly introduced function:
BN_U_DLEN returns the array size of the current sturcture instance, which I added because I cannot affort having every BIGNUM at the maximum required size among all.
This new function is also a really important one performance-wise for different tasks as it allows super-fast construction of 2's potencies. For example, I also used it for the calculation of R², used for Montgomery arithmetics, where R is a 2's potency. The BoringSSL algorithm utilizes BIGNUM shift, pow and mul algorithms for its construction, when with this function, it can be achieved easily with just a preceeding BIGNUM init.