Closed heshpdx closed 3 months ago
I don't have a lot of context on the error here, but two points of context:
benchmarkPolygonToCellsExperimental
, which use our new polyfill algorithm, so it may not be worth a deep dive here.We see the same FP exception when running benchmarkPolygonToCellsExperimental
. Using a slightly modified patch from above...
+ if (exceptions & (FE_INEXACT))
+ printf("inexact j=%d\n", j);
+ else
+ printf("good j=%d\n", j);
Then looking at the results:
$ ./bin/benchmarkPolygonToCellsExperimental > output.log
$ wc -l output.log
35541009 output.log
$ sort output.log | grep good | uniq -c | sort -n
420 good j=-256
6120 good j=-8
9090 good j=-2
9090 good j=0
12120 good j=-1
27000 good j=-2048
30240 good j=-4
36360 good j=1
80560 good j=-4096
Powers of two look ok, everything else signals INEXACT. Maybe that provides a clue?
After some playing around, this patch makes the exceptions go away. But I have no idea if it remains precise enough. Casting to float j
does not fix it. Casting double i
is not part of the fix, but I added it for consistency.
diff --git a/src/h3lib/lib/coordijk.c b/src/h3lib/lib/coordijk.c
index 7f9ff7cb..698baf47 100644
--- a/src/h3lib/lib/coordijk.c
+++ b/src/h3lib/lib/coordijk.c
@@ -153,11 +154,22 @@ void _hex2dToCoordIJK(const Vec2d *v, CoordIJK *h) {
* @param v The 2D cartesian coordinates of the hex center point.
*/
void _ijkToHex2d(const CoordIJK *h, Vec2d *v) {
- int i = h->i - h->k;
- int j = h->j - h->k;
+ double i = h->i - h->k;
+ double j = h->j - h->k;
v->x = i - 0.5 * j;
- v->y = j * M_SQRT3_2;
+ v->y = j * (float)M_SQRT3_2;
I did some reading, and the reason why power-of-two is not raising inexact is because those just increment or decrement the exponent.
But also, this exception being raised is normal. According to this comment
_EM_INEXACT is raised too frequently to be of any practical use due to the nature of floating point arithmetic, although it can be informative if trying to track down imprecise results in some situations.
In that same response, it's implied that a _control_fp
call can let you fully disable the exception flag. I think setting _MCW_EM
to 0 will disable all error flags, but the documentation isn't super clear what will happen when you do that.
For some reason your second comment didn't show up while I was writing my response.
Casting M_SQRT3_2
to a float halves it's precision, so that's definitely not something we can do.
From what I have read INEXACT
is an expected result for floating point calculations, because the mantissa is being rounded in most operations (even the simple 0.1 + 0.2 = 0.300...04
example often shared).
If setting that flag in the kernel is taking so much time, it's probably better to just disable it during this code and then reset it back to normal after the code is finished running?
_control_fp
is Microsoft specific. I tried fedisableexcept(FE_ALL_EXCEPT)
which is the C standard version and can be found with #include <fenv.h>
. However this didn't change anything and I still see time spent in the kernel handler. It is interesting that of all the computations done in h3, this one line is triggering the exception. What is so special about it?
I'm wondering if anyone else can reproduce the high time spent in kernel. I ran perf record
on an Ampere ARM platform running Ubuntu 22.
So, what is going on here is, that soft FP routines are called, where the new FP flag states have to be calculated along with the result of the operation.
To get more insights, I've created the following C program:
#define M_SQRT3_2 0.8660254037844386467637231707529361834714L
typedef struct {
int i;
int j;
int k;
} CoordIJK;
typedef struct {
double x;
double y;
} Vec2d;
void _ijkToHex2d_orig(const CoordIJK *h, Vec2d *v) {
int i = h->i - h->k;
int j = h->j - h->k;
v->x = i - 0.5 * j;
v->y = j * M_SQRT3_2;
}
void _ijkToHex2d_fast(const CoordIJK *h, Vec2d *v) {
double i = h->i - h->k;
double j = h->j - h->k;
v->x = i - 0.5 * j;
v->y = j * (float)M_SQRT3_2;
}
Compiled with gcc -O3 -c test.c -S
(GCC 14.1) gives us the following:
_ijkToHex2d_orig:
stp x29, x30, [sp, -32]!
fmov d30, 5.0e-1
mov x29, sp
ldr w2, [x0]
ldr w3, [x0, 8]
str x19, [sp, 16]
mov x19, x1
ldr w1, [x0, 4]
sub w0, w1, w3
sub w1, w2, w3
scvtf d29, w0
scvtf d31, w1
fmsub d31, d29, d30, d31
str d31, [x19]
bl __floatsitf
adrp x0, .LC1
add x0, x0, :lo12:.LC1
ldr q1, [x0]
bl __multf3
bl __trunctfdf2
str d0, [x19, 8]
ldr x19, [sp, 16]
ldp x29, x30, [sp], 32
ret
_ijkToHex2d_fast:
adrp x2, .LC2
fmov d0, 5.0e-1
ldr d29, [x2, #:lo12:.LC2]
ldp w2, w3, [x0, 4]
ldr w0, [x0]
sub w2, w2, w3
sub w0, w0, w3
scvtf d31, w2
scvtf d30, w0
fmsub d0, d31, d0, d30
fmul d29, d31, d29
stp d0, d29, [x1]
ret
Above, we can see that GCC inserts calls to soft FP routines, which will result in calls to __sfp_handle_exceptions
in libgcc.
This behavior is not seen on x86-64, so it looks like a limitation of the AArch64 code generation because of no matching FP instructions. And indeed, if building with -mcpu=generic-armv9-a
, then the soft FP calls disappear:
_ijkToHex2d_orig:
adrp x2, .LC0
fmov d0, 5.0e-1
ldr d29, [x2, #:lo12:.LC0]
ldp w2, w3, [x0, 4]
ldr w0, [x0]
sub w2, w2, w3
sub w0, w0, w3
scvtf d31, w2
scvtf d30, w0
fmsub d0, d31, d0, d30
fmul d29, d31, d29
stp d0, d29, [x1]
ret
Given that this sequence is identical to the _ijkToHex2d_fast
sequence above, there might be a limitation in the GCC AArch64 backend, which prevents some optimization.
Is this still the behavior when M_SQRT3_2
does NOT have an L
at the end? We did just remove long double from the source, and I believe I am testing the new code:
https://github.com/uber/h3/blob/master/src/h3lib/include/constants.h#L44
Yes, the L
can be seen in my code at the top of my comment (I somehow missed #852 - sorry for that).
And yes, removing the L
from the FP literal of M_SQRT3_2
removes the soft FP calls as well.
So, you say that even without the L
, you can see the __sfp_handle_exceptions
calls?
And your changes to _ijkToHex2d
did remove them?
Or is this what you are currently testing?
Thanks @cmuellner , I think I had my wires crossed. The latest mainline looks much better for benchmarkPolygonToCells
:
Even though FP_INEXACT is signaled, it is not handled. x86 and aarch64 match, so this is great.
I noticed that
bin/benchmarkPolygonToCells
is taking up a lot of time in the OS kernel: about 15% using either GCC or LLVM. After some debug, I found that the floating point exception handler is being called. Here is a "perf report" from GCC where the __sfp_handle_exceptions call is highlighted. Check out the kernel function at the top.Can you share info on the intent of this benchmark? Do we want it exercising the denormal flows? Are the exceptions an artifact of the benchmark data, or is there an issue with the code? That is, do we see this behavior live in production?
The handler being triggered is FP_INEXACT, and I see it on both aarch64 and x86. The fenv(3) man page tells us:
Interestingly, overflow or underflow is not occurring here. Below is a quick patch to showcase which exception is being handled and where it occurs (
v->y = j * M_SQRT3_2
). From what I can tell, that is the only operation which throws an exception, and it always throws FP_INEXACT and nothing else.@rohitkprasad123