kokke / tiny-bignum-c

Small portable multiple-precision unsigned integer arithmetic in C
The Unlicense
424 stars 85 forks source link

Optimization proposal for divisions #12

Open mhaeuser opened 5 years ago

mhaeuser commented 5 years ago

The current division algorithm uses a BIGNUM structure for 'current' while its task can be completed with natural integer arithmetics. In my local fork (sorry, it's still a hot mess), I have changed the algorithm as follows, where several names and small things have been altered, but should still be recognizable fine:

  //
  // Use an integer of natural size to store the current Bit index as 'current'
  // is always a 2's potency. While a BIGNUM can theoretically hold a
  // 2's potency eight times larger than what can represent as Bit index with a
  // natural integer (Bytes vs Bits), this cannot happen within this function
  // as 'a' aligned to the next 2's potency would need to be just as big for
  // this to be the case. This cannot happen due to the address space
  // limitation.
  //
  size_t currentBitIndex = 0;                         // int current = 1;         
  bignum_u_assign_const(denom, b);                    // denom = b
  bignum_u_assign_const(tmp, a);                      // tmp   = a

  const DTYPE half_max = 1 + (DTYPE)(((1ULL << (8 * WORD_SIZE)) - 1U) / 2);
  bool overflow = false;
  while (bignum_u_cmp(denom, a) != LARGER)            // while (denom <= a) {
  {
    if (denom->d[BN_U_DLEN(denom) - 1] >= half_max)
    {
      overflow = true;
      break;
    }
    ++currentBitIndex;                                //   current <<= 1;                 
    _lshift_bit_const(denom, 1);                      //   denom <<= 1;
  }
  if (!overflow)
  {
    _rshift_bit_const(denom, 1);                      // denom >>= 1;
    --currentBitIndex;                                // current >>= 1;                 
  }
  bignum_u_init_const(c);                             // int answer = 0;
  //
  // currentBitIndex cannot add-wraparound to reach this value as reasoned in
  // the comment before.
  //
  while(currentBitIndex != (size_t)0U - 1U)           // while (current != 0)
  {
    if (bignum_u_cmp(tmp, denom) != SMALLER)          //   if (dividend >= denom)
    {
      bignum_u_sub(tmp, denom, tmp);                  //     dividend -= denom;            
      bignum_u_or_int (c, 1, currentBitIndex);        //     answer |= current;
    }
    --currentBitIndex;                                //   current >>= 1;
    _rshift_bit_const(denom, 1);                      //   denom >>= 1;
  } 

Along with this newly introduced function:

void bignum_u_or_int(ubn* a, DTYPE v, size_t nbits)
{
  require(a, "a is null");

  const uint8_t nbits_pr_word = (WORD_SIZE * 8);
  size_t nwords   = nbits / nbits_pr_word;
  uint8_t  nwordbits = nbits % nbits_pr_word;

  if (nwords >= BN_U_DLEN (a)) {
    return;
  }

  a->d[nwords] |= (v << nwordbits);
}

BN_U_DLEN returns the array size of the current sturcture instance, which I added because I cannot affort having every BIGNUM at the maximum required size among all.

This new function is also a really important one performance-wise for different tasks as it allows super-fast construction of 2's potencies. For example, I also used it for the calculation of R², used for Montgomery arithmetics, where R is a 2's potency. The BoringSSL algorithm utilizes BIGNUM shift, pow and mul algorithms for its construction, when with this function, it can be achieved easily with just a preceeding BIGNUM init.

ylluminate commented 5 years ago

Curious @Download-Fritz have you run any speed tests to measure/monitor differentials from your optimizing upgrade?

mhaeuser commented 5 years ago

Good day,

Sorry, not for this specifically, this was just one of many changes I had been performing to multiple origin codes simultaneously. I'm not even sure whether I didn't actually optimise more in the end, unfortunately my usage purpose required hefty modifications that didn't allow me to keep it in a shape where I could PR changes. If you're curious, you can find it here: https://github.com/acidanthera/OcSupportPkg/tree/master/Library/OcCryptoLib (the license is project convention, I'm fine with my changes being considered CC0) - thanks again for your work.

Generally though, I faced significant speed boosts from eliminating BIGNUM operations in other places such as Montgomery parameter calculation, where my code went from overly significant (I think about 0.3 s?) to insignificant (I think about 0.002 s? hot functions were all conhost). For sufficiently large numbers / number of operations, I think it can be worth it.

Also, I just realised using size_t assumes a flat memory model, you might want to use uintptr_t instead.

mhaeuser commented 4 years ago

Sorry for the mess here, I'm a bit busy, but I'd still like to contribute something back from my changes here indirectly. I was looking to find an efficient modulo algorithm to replace the code I took from this project and ended up finding a vague description I then implemented. As it shaped, I realised it is basically your division algorithm. There are two points here:

1) Algorithmic optimization: I performed a couple of optimisations. One is replacing the "Current" BN with an integer as initially described, but I also sped up shifting by "skipping forward": https://github.com/acidanthera/OcSupportPkg/blob/3e9ef5ac412d94ff5999c881e64681bd8864bed2/Library/OcCryptoLib/BigNumPrimitives.c#L639 and https://github.com/acidanthera/OcSupportPkg/blob/3e9ef5ac412d94ff5999c881e64681bd8864bed2/Library/OcCryptoLib/BigNumPrimitives.c#L689. The optimisation is proven here: https://github.com/acidanthera/OcSupportPkg/blob/3e9ef5ac412d94ff5999c881e64681bd8864bed2/Library/OcCryptoLib/BigNumPrimitives.c#L632 (in context of the invariant: https://github.com/acidanthera/OcSupportPkg/blob/3e9ef5ac412d94ff5999c881e64681bd8864bed2/Library/OcCryptoLib/BigNumPrimitives.c#L624)

2) The modulus: Honestly, looking back, it is obvious, but I'd lie if I said this came to me while I worked with the existing algorithms, haha... Your existing division algorithm computes both the quotient and the modulus at the same time, while unfortunately latter is scrapped and re-computed from the quotient here: https://github.com/kokke/tiny-bignum-c/blob/master/bn.c#L414 The modulus is stored in bignum_div's "tmp" after the loop has terminated. I have proven this here: https://github.com/acidanthera/OcSupportPkg/blob/3e9ef5ac412d94ff5999c881e64681bd8864bed2/Library/OcCryptoLib/BigNumPrimitives.c#L743 This saves you a multiplication and a subtraction per modulo operation, sounds like a good deal.

Thank you for your project.

kokke commented 4 years ago

Hi @Download-Fritz and sorry for my blatant absence in this discussion.

Thank you for the contribution - I am really happy to see that the project is of use to others :) And thank you for the explanations of how to implement the proposed optimizations.

I have a busy schedule (don't we all?), so I will probably not get these changes implemented straight away - especially since I have to think a little first, and do some work myself - it's much easier to just merge a PR ;)

I will look into implementing the optimizations you suggest - they should provide a good performance boost.