Open donhatch opened 2 months ago
Even if it turns out that canonical_reordering_sign_euclidean()
is not typically a bottleneck in clifford programs that people use this library for, I think there would still be benefit to implementing this function in this way, so that it serves as a nice example of the optimal (as far as I know) algorithm for this interesting and important sub-problem.
I wrote:
(sorry, I haven't tested it; I haven't learned how to build and test yet)
Ok, I seem to be up and running, using the sequence of workarounds described here: https://github.com/pygae/clifford/issues/430#issuecomment-2370177355
I'll be happy to make a PR for this if there's interest.
I did some tracing and timing.
It turns out this function isn't called a lot during multivector multiplications like I thought it would be.
Instead, I see that it's called a lot on the first multiplication after the clifford algebra is created. At that time, the entire multiplication table gets created, in _numba_construct_gmt(). E.g. for n=10 dimensions, this function is called 210 * 210 = 1048576 times. And the average number of bits participating, over all those calls, is probably something like (n-2)*0.75 (not 1 or 2, which, I suspect, is typical for actual multiplications performed in actual programs, e.g. if they are manipulating only scalars and vectors and bivectors).
So, not a lot of savings, given how this function is currently called.
However, I see in issue https://github.com/pygae/clifford/issues/3 that there might be plans to change the initialization strategy so that it does not create the entire multiplication table. If/when that happens, I suspect optimization of canonical_reordering_sign_euclidean() may look like more of a win than it does now.
One more observation...
I suspect typical real-life programs use grades 0,1,2, slightly less commonly n-2,n-1,n, and rarely anything in between.... does that sound right? I think my suggested rewrite of this function will be very fast when the operands have grades (i.e. numbers of bits set) 0,1,2, but pretty slow for n-2,n-1,n, when n is relatively large.
But, I bet, with a little thought, the slow n-2,n-1,n cases can be transformed into the fast 0,1,2 cases, by taking complements of the bit patterns, or something close to that. And I bet, with some more thought, it's even possible to make it fast when one of the two grades is in {0,1,2} and the other is in {n-2,n-1,n}.
I thought of another possible implementation. This one wins when the bitmaps are big and dense on average (as they are during initialization of big layouts, currently). It's probably not as fast as my first proposal when the two input bitmaps have few bits set, though.
def xor_all_shifts(n):
""" Returns n ^ (n>>1) ^ (n>>2) ^ (n>>3) ^ ... """
shift = 1
while (n_shifted := n >> shift) != 0:
n ^= n_shifted
shift *= 2
return n
def canonical_reordering_sign_euclidean(bitmap_a, bitmap_b):
"""
Computes the sign for the product of bitmap_a and bitmap_b
assuming a euclidean metric
"""
return ((-1) **
(xor_all_shifts((xor_all_shifts(bitmap_a >> 1) & bitmap_b)) & 1))
This is really interesting work @donhatch thanks for the suggestions. In this package we do indeed pre-calculate all the sign changes during algebra initialisation. As you have pointed out however it is definitely not the only way to do it, and indeed for very large algebras where one might want to do sparse calculations it might make more sense to do the canonical reordering in the actual multiplication function directly. One of the problems with sparse algebra stuff is that often numerical errors build up in the bits that "should be 0" and so calculations becomes progressively more dense over time anyway unless a bit of care is taken to trim/clean your multivectors. I'd be super keen to work on integrating this stuff into clifford and i'm sure some of the other packages which implement GA would be interested too.
I have a few days this week to work on clifford stuff :) Do you happen to be on the bivector discord, could discuss a bit there? (I'm hugohadfield there)
It looks to me like
canonical_reordering_sign_euclidean()
's running time is probably quadratic in the highest index of any bit set inbitmap_a
(or maybe linear, ifcount_bits_set()
is fast, i.e. if not_utils.DISABLE_JIT
). I think this could be improved to be at most linear in the total number of set bits in the two input bitmasks (often a very small number like 1 or 2, I'm guessing), and in fact even smaller than that-- it can be made linear in the number of bits that end up participating in swaps.(To be precise: it needs to iterate over only the bits that are set in the bitwise-OR (bitmask_a|bitmask_b) whose indices are strictly higher than the lowest bit set in bitmask_b, and strictly lower than the highest bit set in bitmask_a).
I haven't looked at the call graph very closely, nor have I done any benchmarks, but I suspect this function is called heavily during multiplications of multivectors, so it might be worth optimizing.
For reference, the current implementation is this:
I think the following rewrite would work (sorry, I haven't tested it; I haven't learned how to build and test yet):
Also, I think the
_utils.DISABLE_JIT
implementation of the helper functioncount_set_bits()
could be sped up pretty simply too (it's no longer needed in the above implementation ofcanonical_reordering_sign_euclidean()
, but it's called by other things).That's currently this:
which could be rewritten as this: