canonical_reordering_sign_euclidean() could be made faster

donhatch commented 2 months ago

It looks to me like canonical_reordering_sign_euclidean()'s running time is probably quadratic in the highest index of any bit set in bitmap_a (or maybe linear, if count_bits_set() is fast, i.e. if not _utils.DISABLE_JIT). I think this could be improved to be at most linear in the total number of set bits in the two input bitmasks (often a very small number like 1 or 2, I'm guessing), and in fact even smaller than that-- it can be made linear in the number of bits that end up participating in swaps.

(To be precise: it needs to iterate over only the bits that are set in the bitwise-OR (bitmask_a|bitmask_b) whose indices are strictly higher than the lowest bit set in bitmask_b, and strictly lower than the highest bit set in bitmask_a).

I haven't looked at the call graph very closely, nor have I done any benchmarks, but I suspect this function is called heavily during multiplications of multivectors, so it might be worth optimizing.

For reference, the current implementation is this:

def canonical_reordering_sign_euclidean(bitmap_a, bitmap_b):
    """ 
    Computes the sign for the product of bitmap_a and bitmap_b
    assuming a euclidean metric
    """
    a = bitmap_a >> 1
    sum_value = 0
    while a != 0:
        sum_value = sum_value + count_set_bits(a & bitmap_b)
        a = a >> 1
    if (sum_value & 1) == 0:
        return 1
    else:
        return -1

I think the following rewrite would work (sorry, I haven't tested it; I haven't learned how to build and test yet):

def canonical_reordering_sign_euclidean(bitmap_a, bitmap_b):
  """
  Computes the sign for the product of bitmap_a and bitmap_b
  assuming a euclidean metric
  """

  # Bits in bitmap_a at or lower than the lowest bit in bitmap_b don't matter,
  # so we start by zeroing them out to avoid wasting loop iterations on them.
  # (Note that if bitmap_b is 0, this just zeros out bitmap_a too, which is appropriate.)
  bitmap_a &= ~(((bitmap_b & -bitmap_b)<<1) - 1)

  sign = 1  # (-1)**(num needed swaps seen so far)
  bsign = 1  # (-1)**(num set bits seen in bitmap_b so far)
  remaining_union = bitmap_a | bitmap_b
  while (remaining_union & bitmap_a) != 0:
    lowest_remaining_bit = remaining_union & -remaining_union
    if (bitmap_a & lowest_remaining_bit) != 0:
      # Let i be the index of lowest_remaining_bit.
      # The number of set indices strictly lower than i in bitmap_b
      # has parity bsign, and the i in bitmap_a needs to move to the right
      # past each of them.
      sign *= bsign
    if (bitmap_b & lowest_remaining_bit) != 0:
      bsign *= -1
    remaining_union &= ~lowest_remaining_bit
  return sign

Also, I think the _utils.DISABLE_JIT implementation of the helper function count_set_bits() could be sped up pretty simply too (it's no longer needed in the above implementation of canonical_reordering_sign_euclidean(), but it's called by other things).

That's currently this:

def count_set_bits(bitmap: int) -> int:
    """ Counts the number of bits set to 1 in bitmap """
    count = 0 
    for i in set_bit_indices(bitmap):
        count += 1 
    return count

which could be rewritten as this:

def count_set_bits(bitmap: int) -> int:
  """ Counts the number of bits set to 1 in bitmap """
  # Kernighan's bit counting algorithm
  count = 0 
  while bitmap != 0:
    bitmap &= bitmap-1  # turn off lowest bit
    count += 1 
  return count

donhatch commented 2 months ago

Even if it turns out that canonical_reordering_sign_euclidean() is not typically a bottleneck in clifford programs that people use this library for, I think there would still be benefit to implementing this function in this way, so that it serves as a nice example of the optimal (as far as I know) algorithm for this interesting and important sub-problem.

donhatch commented 2 months ago

I wrote:

(sorry, I haven't tested it; I haven't learned how to build and test yet)

Ok, I seem to be up and running, using the sequence of workarounds described here: https://github.com/pygae/clifford/issues/430#issuecomment-2370177355

I'll be happy to make a PR for this if there's interest.

donhatch commented 2 months ago

I did some tracing and timing.

It turns out this function isn't called a lot during multivector multiplications like I thought it would be.

Instead, I see that it's called a lot on the first multiplication after the clifford algebra is created. At that time, the entire multiplication table gets created, in _numba_construct_gmt(). E.g. for n=10 dimensions, this function is called 2¹⁰ * 2¹⁰ = 1048576 times. And the average number of bits participating, over all those calls, is probably something like (n-2)*0.75 (not 1 or 2, which, I suspect, is typical for actual multiplications performed in actual programs, e.g. if they are manipulating only scalars and vectors and bivectors).

So, not a lot of savings, given how this function is currently called.

However, I see in issue https://github.com/pygae/clifford/issues/3 that there might be plans to change the initialization strategy so that it does not create the entire multiplication table. If/when that happens, I suspect optimization of canonical_reordering_sign_euclidean() may look like more of a win than it does now.

donhatch commented 2 months ago

One more observation...

I suspect typical real-life programs use grades 0,1,2, slightly less commonly n-2,n-1,n, and rarely anything in between.... does that sound right? I think my suggested rewrite of this function will be very fast when the operands have grades (i.e. numbers of bits set) 0,1,2, but pretty slow for n-2,n-1,n, when n is relatively large.

But, I bet, with a little thought, the slow n-2,n-1,n cases can be transformed into the fast 0,1,2 cases, by taking complements of the bit patterns, or something close to that. And I bet, with some more thought, it's even possible to make it fast when one of the two grades is in {0,1,2} and the other is in {n-2,n-1,n}.

donhatch commented 1 month ago

I thought of another possible implementation. This one wins when the bitmaps are big and dense on average (as they are during initialization of big layouts, currently). It's probably not as fast as my first proposal when the two input bitmaps have few bits set, though.

def xor_all_shifts(n):
  """ Returns n ^ (n>>1) ^ (n>>2) ^ (n>>3) ^ ... """
  shift = 1
  while (n_shifted := n >> shift) != 0:
    n ^= n_shifted
    shift *= 2
  return n

def canonical_reordering_sign_euclidean(bitmap_a, bitmap_b):
    """
    Computes the sign for the product of bitmap_a and bitmap_b
    assuming a euclidean metric
    """
    return ((-1) **
            (xor_all_shifts((xor_all_shifts(bitmap_a >> 1) & bitmap_b)) & 1))

hugohadfield commented 1 month ago

This is really interesting work @donhatch thanks for the suggestions. In this package we do indeed pre-calculate all the sign changes during algebra initialisation. As you have pointed out however it is definitely not the only way to do it, and indeed for very large algebras where one might want to do sparse calculations it might make more sense to do the canonical reordering in the actual multiplication function directly. One of the problems with sparse algebra stuff is that often numerical errors build up in the bits that "should be 0" and so calculations becomes progressively more dense over time anyway unless a bit of care is taken to trim/clean your multivectors. I'd be super keen to work on integrating this stuff into clifford and i'm sure some of the other packages which implement GA would be interested too.

I have a few days this week to work on clifford stuff :) Do you happen to be on the bivector discord, could discuss a bit there? (I'm hugohadfield there)

pygae / clifford

canonical_reordering_sign_euclidean() could be made faster #433