Open jbrockmendel opened 4 years ago
cc @seberg any idea why this is so much slower in the not-short-circuit case?
@jbrockmendel my guess is: We are using SIMD loops in this case (at least sometimes), and cython probably does not manage to do that.
Are you sure what you are doing is a good idea though? First, -0 and 0 are not different normally. Second, there are many many NaNs, and you are making some of them return True.
We are using SIMD loops in this case (at least sometimes), and cython probably does not manage to do that.
@scoder thoughts?
SIMD
Might make a difference here, yes. Make sure your CFLAGS allow for auto-vectorisation, and that your C compiler is modern enough to figure it out for you.
It sometimes helps the compiler to make your algorithm redundant, i.e. instead of just a[i] != b[i]
make sure the arrays are long enough, take two steps at once and write a[i] != b[i] or a[i+1] != b[i+1]
, or even using the uglier |
instead of or
(try to avoid that if you don't need it). That way, the code makes it clear to the C compiler that it can safely read ahead far enough to use wider SIMD instructions.
In places like
equals
methods andarray_equivalent
, we do things like(left == right).all()
or((left == right) | (isna(left) & isna(right))).all()
. For large arrays that are not equal, we can do much better with something like:Some profiling results:
So in cases that short circuit early, we can get massive speedups, but this implementation is actually 2x slower in cases that dont short-circuit (for reasons that are not clear to me).