FP exception handler triggered on both x86 and aarch64

heshpdx commented 4 months ago

I noticed that bin/benchmarkPolygonToCells is taking up a lot of time in the OS kernel: about 15% using either GCC or LLVM. After some debug, I found that the floating point exception handler is being called. Here is a "perf report" from GCC where the __sfp_handle_exceptions call is highlighted. Check out the kernel function at the top.

Can you share info on the intent of this benchmark? Do we want it exercising the denormal flows? Are the exceptions an artifact of the benchmark data, or is there an issue with the code? That is, do we see this behavior live in production?

The handler being triggered is FP_INEXACT, and I see it on both aarch64 and x86. The fenv(3) man page tells us:

The inexact exception occurs when the rounded result of an operation is not equal to the infinite precision result. It may occur whenever overflow or underflow occurs.

Interestingly, overflow or underflow is not occurring here. Below is a quick patch to showcase which exception is being handled and where it occurs (v->y = j * M_SQRT3_2). From what I can tell, that is the only operation which throws an exception, and it always throws FP_INEXACT and nothing else.

diff --git a/src/h3lib/lib/coordijk.c b/src/h3lib/lib/coordijk.c
index 7f9ff7cb..34969c59 100644
--- a/src/h3lib/lib/coordijk.c
+++ b/src/h3lib/lib/coordijk.c
@@ -29,6 +29,7 @@
 #include "h3Assert.h"
 #include "latLng.h"
 #include "mathExtensions.h"
+#include <fenv.h>

 #define INT32_MAX_3 (INT32_MAX / 3)

@@ -157,7 +158,18 @@ void _ijkToHex2d(const CoordIJK *h, Vec2d *v) {
     int j = h->j - h->k;

     v->x = i - 0.5 * j;
+    feclearexcept(FE_ALL_EXCEPT);
     v->y = j * M_SQRT3_2;
+
+    int exceptions = fetestexcept(FE_ALL_EXCEPT);
+    if (FE_UNDERFLOW & exceptions)
+        printf("underflow\n");
+    if (FE_OVERFLOW & exceptions)
+        printf("overflow\n");
+    if (FE_INVALID & exceptions)
+        printf("invalid\n");
+    if (FE_INEXACT & exceptions)
+        printf("inexact\n");
 }

@rohitkprasad123

nrabinowitz commented 4 months ago

I don't have a lot of context on the error here, but two points of context:

This benchmark is not intended to have errors, and looking at the code I'm surprised that it would be subject to out-of-bounds cases - the polygons and resolutions we use are fairly straightforward.
We will be replacing this benchmark shortly (hopefully in the next release) with the benchmarks in benchmarkPolygonToCellsExperimental, which use our new polyfill algorithm, so it may not be worth a deep dive here.

heshpdx commented 4 months ago

We see the same FP exception when running benchmarkPolygonToCellsExperimental. Using a slightly modified patch from above...

+    if (exceptions & (FE_INEXACT))
+        printf("inexact j=%d\n", j);
+    else
+        printf("good    j=%d\n", j);

Then looking at the results:

$ ./bin/benchmarkPolygonToCellsExperimental > output.log

$ wc -l output.log
35541009 output.log

$ sort output.log | grep good | uniq -c | sort -n
    420 good    j=-256
   6120 good    j=-8
   9090 good    j=-2
   9090 good    j=0
  12120 good    j=-1
  27000 good    j=-2048
  30240 good    j=-4
  36360 good    j=1
  80560 good    j=-4096

Powers of two look ok, everything else signals INEXACT. Maybe that provides a clue?

heshpdx commented 4 months ago

After some playing around, this patch makes the exceptions go away. But I have no idea if it remains precise enough. Casting to float j does not fix it. Casting double i is not part of the fix, but I added it for consistency.

diff --git a/src/h3lib/lib/coordijk.c b/src/h3lib/lib/coordijk.c
index 7f9ff7cb..698baf47 100644
--- a/src/h3lib/lib/coordijk.c
+++ b/src/h3lib/lib/coordijk.c
@@ -153,11 +154,22 @@ void _hex2dToCoordIJK(const Vec2d *v, CoordIJK *h) {
  * @param v The 2D cartesian coordinates of the hex center point.
  */
 void _ijkToHex2d(const CoordIJK *h, Vec2d *v) {
-    int i = h->i - h->k;
-    int j = h->j - h->k;
+    double i = h->i - h->k;
+    double j = h->j - h->k;

     v->x = i - 0.5 * j;
-    v->y = j * M_SQRT3_2;
+    v->y = j * (float)M_SQRT3_2;

dfellis commented 4 months ago

I did some reading, and the reason why power-of-two is not raising inexact is because those just increment or decrement the exponent.

But also, this exception being raised is normal. According to this comment

_EM_INEXACT is raised too frequently to be of any practical use due to the nature of floating point arithmetic, although it can be informative if trying to track down imprecise results in some situations.

In that same response, it's implied that a _control_fp call can let you fully disable the exception flag. I think setting _MCW_EM to 0 will disable all error flags, but the documentation isn't super clear what will happen when you do that.

dfellis commented 4 months ago

For some reason your second comment didn't show up while I was writing my response.

Casting M_SQRT3_2 to a float halves it's precision, so that's definitely not something we can do.

From what I have read INEXACT is an expected result for floating point calculations, because the mantissa is being rounded in most operations (even the simple 0.1 + 0.2 = 0.300...04 example often shared).

If setting that flag in the kernel is taking so much time, it's probably better to just disable it during this code and then reset it back to normal after the code is finished running?

heshpdx commented 4 months ago

_control_fp is Microsoft specific. I tried fedisableexcept(FE_ALL_EXCEPT) which is the C standard version and can be found with #include <fenv.h>. However this didn't change anything and I still see time spent in the kernel handler. It is interesting that of all the computations done in h3, this one line is triggering the exception. What is so special about it?

I'm wondering if anyone else can reproduce the high time spent in kernel. I ran perf record on an Ampere ARM platform running Ubuntu 22.

cmuellner commented 3 months ago

So, what is going on here is, that soft FP routines are called, where the new FP flag states have to be calculated along with the result of the operation.

To get more insights, I've created the following C program:

#define M_SQRT3_2 0.8660254037844386467637231707529361834714L

typedef struct {
    int i;
    int j;
    int k;
} CoordIJK;

typedef struct {
    double x;
    double y;
} Vec2d;

void _ijkToHex2d_orig(const CoordIJK *h, Vec2d *v) {
    int i = h->i - h->k;
    int j = h->j - h->k;

    v->x = i - 0.5 * j;
    v->y = j * M_SQRT3_2;
}

void _ijkToHex2d_fast(const CoordIJK *h, Vec2d *v) {
    double i = h->i - h->k;
    double j = h->j - h->k;

    v->x = i - 0.5 * j;
    v->y = j * (float)M_SQRT3_2;
}

Compiled with gcc -O3 -c test.c -S (GCC 14.1) gives us the following:

_ijkToHex2d_orig:
        stp     x29, x30, [sp, -32]!
        fmov    d30, 5.0e-1
        mov     x29, sp
        ldr     w2, [x0]
        ldr     w3, [x0, 8]
        str     x19, [sp, 16]
        mov     x19, x1
        ldr     w1, [x0, 4]
        sub     w0, w1, w3
        sub     w1, w2, w3
        scvtf   d29, w0
        scvtf   d31, w1
        fmsub   d31, d29, d30, d31
        str     d31, [x19]
        bl      __floatsitf
        adrp    x0, .LC1
        add     x0, x0, :lo12:.LC1
        ldr     q1, [x0]
        bl      __multf3
        bl      __trunctfdf2
        str     d0, [x19, 8]
        ldr     x19, [sp, 16]
        ldp     x29, x30, [sp], 32
        ret

_ijkToHex2d_fast:
        adrp    x2, .LC2
        fmov    d0, 5.0e-1
        ldr     d29, [x2, #:lo12:.LC2]
        ldp     w2, w3, [x0, 4]
        ldr     w0, [x0]
        sub     w2, w2, w3
        sub     w0, w0, w3
        scvtf   d31, w2
        scvtf   d30, w0
        fmsub   d0, d31, d0, d30
        fmul    d29, d31, d29
        stp     d0, d29, [x1]
        ret

Above, we can see that GCC inserts calls to soft FP routines, which will result in calls to __sfp_handle_exceptions in libgcc.

This behavior is not seen on x86-64, so it looks like a limitation of the AArch64 code generation because of no matching FP instructions. And indeed, if building with -mcpu=generic-armv9-a, then the soft FP calls disappear:

_ijkToHex2d_orig:
        adrp    x2, .LC0
        fmov    d0, 5.0e-1
        ldr     d29, [x2, #:lo12:.LC0]
        ldp     w2, w3, [x0, 4]
        ldr     w0, [x0]
        sub     w2, w2, w3
        sub     w0, w0, w3
        scvtf   d31, w2
        scvtf   d30, w0
        fmsub   d0, d31, d0, d30
        fmul    d29, d31, d29
        stp     d0, d29, [x1]
        ret

Given that this sequence is identical to the _ijkToHex2d_fast sequence above, there might be a limitation in the GCC AArch64 backend, which prevents some optimization.

heshpdx commented 3 months ago

Is this still the behavior when M_SQRT3_2 does NOT have an L at the end? We did just remove long double from the source, and I believe I am testing the new code: https://github.com/uber/h3/blob/master/src/h3lib/include/constants.h#L44

cmuellner commented 3 months ago

Yes, the L can be seen in my code at the top of my comment (I somehow missed #852 - sorry for that). And yes, removing the L from the FP literal of M_SQRT3_2 removes the soft FP calls as well.

So, you say that even without the L, you can see the __sfp_handle_exceptions calls? And your changes to _ijkToHex2d did remove them? Or is this what you are currently testing?

heshpdx commented 3 months ago

Thanks @cmuellner , I think I had my wires crossed. The latest mainline looks much better for benchmarkPolygonToCells:

Even though FP_INEXACT is signaled, it is not handled. x86 and aarch64 match, so this is great.

uber / h3

FP exception handler triggered on both x86 and aarch64 #891