[LibOS,Pal] Move to GCC atomic built-ins and fix memory orders

dimakuv commented 4 years ago

Description of the problem

We started moving from musl-borrowed atomic.h primitives to GCC atomic built-ins (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html). We need to complete this transition. As part of this transition, we must:

Also switch from C11 atomics to GCC built-ins (thankfully, all modern compilers support GCC-style atomics)
Improve memory orders, e.g. relax __ATOMIC_SEQ_CST to the appropriate ones.

stefanberger commented 4 years ago

FYI: After PR #1574 I have to work with this patch now on ppc64 to make the spinlock testcase work reliably.

diff --git a/Pal/src/host/Linux/db_events.c b/Pal/src/host/Linux/db_events.c
index d53d3585..16837ec7 100644
--- a/Pal/src/host/Linux/db_events.c
+++ b/Pal/src/host/Linux/db_events.c
@@ -37,8 +37,8 @@ int _DkEventSet(PAL_HANDLE event, int wakeup) {
     if (event->event.isnotification) {
         // Leave it signaled, wake all
         uint32_t t = 0;
-        if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/true,
-                                        __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
+        if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/false,
+                                        __ATOMIC_SEQ_CST, __ATOMIC_RELAXED)) {
             int nwaiters = atomic_read(&event->event.nwaiters);
             if (nwaiters) {
                 if (wakeup != -1 && nwaiters > wakeup)
diff --git a/Pal/src/host/Linux/db_mutex.c b/Pal/src/host/Linux/db_mutex.c
index facde2df..be22b100 100644
--- a/Pal/src/host/Linux/db_mutex.c
+++ b/Pal/src/host/Linux/db_mutex.c
@@ -81,8 +81,8 @@ int _DkMutexLockTimeout(struct mutex_handle* m, int64_t timeout_us) {
      * the timeout.*/
     for (i = 0; i < iterations; i++) {
         uint32_t t = MUTEX_UNLOCKED;
-        if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/true,
-                                        __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
+        if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/false,
+                                        __ATOMIC_SEQ_CST, __ATOMIC_RELAXED))
             goto success;
         CPU_RELAX();
     }
@@ -97,8 +97,8 @@ int _DkMutexLockTimeout(struct mutex_handle* m, int64_t timeout_us) {

     while (true) {
         uint32_t t = MUTEX_UNLOCKED;
-        if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/true,
-                                        __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
+        if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/false,
+                                        __ATOMIC_SEQ_CST, __ATOMIC_RELAXED))
             break;

         struct timespec waittime, *waittimep = NULL;
-- 
2.17.1

stefanberger commented 4 years ago

Actually only changing the 'weak' parameter also works, otherwise it does get stuck occasionally:

diff --git a/Pal/src/host/Linux/db_events.c b/Pal/src/host/Linux/db_events.c
index d53d3585..780d6731 100644
--- a/Pal/src/host/Linux/db_events.c
+++ b/Pal/src/host/Linux/db_events.c
@@ -37,7 +37,7 @@ int _DkEventSet(PAL_HANDLE event, int wakeup) {
     if (event->event.isnotification) {
         // Leave it signaled, wake all
         uint32_t t = 0;
-        if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/true,
+        if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/false,
                                         __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
             int nwaiters = atomic_read(&event->event.nwaiters);
             if (nwaiters) {
diff --git a/Pal/src/host/Linux/db_mutex.c b/Pal/src/host/Linux/db_mutex.c
index facde2df..9de4f6ab 100644
--- a/Pal/src/host/Linux/db_mutex.c
+++ b/Pal/src/host/Linux/db_mutex.c
@@ -81,7 +81,7 @@ int _DkMutexLockTimeout(struct mutex_handle* m, int64_t timeout_us) {
      * the timeout.*/
     for (i = 0; i < iterations; i++) {
         uint32_t t = MUTEX_UNLOCKED;
-        if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/true,
+        if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/false,
                                         __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
             goto success;
         CPU_RELAX();
@@ -97,7 +97,7 @@ int _DkMutexLockTimeout(struct mutex_handle* m, int64_t timeout_us) {

     while (true) {
         uint32_t t = MUTEX_UNLOCKED;
-        if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/true,
+        if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/false,
                                         __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
             break;

mkow commented 4 years ago

Also switch from C11 atomics to GCC built-ins (thankfully, all modern compilers support GCC-style atomics)

I'm still not convinced about using them in places where we aren't forced to (i.e. where we don't use underlying variables also for futex()). #1510 radicalized me on this, the code looks hardly readable in places where we use builtin atomics.

Improve memory orders, e.g. relax __ATOMIC_SEQ_CST to the appropriate ones.

I don't think we have to do this unless we have a benchmark showing that a particular component is noticeably slow. Doing this correctly is very hard and we're risking correctness issues, with most likely negligible speedups.

stefanberger commented 4 years ago

I don't think we have to do this unless we have a benchmark showing that a particular component is noticeably slow. Doing this correctly is very hard and we're risking correctness issues, with most likely negligible speedups.

... and this is why the series of patches I sent today takes a conservative approach with the memory orders. It's quite a few patches with possibly individual decisions about better memory order to make, so from this perspective PR #1601 could be a pain.

dimakuv commented 4 years ago

I don't think we have to do this unless we have a benchmark showing that a particular component is noticeably slow. Doing this correctly is very hard and we're risking correctness issues, with most likely negligible speedups.

True. Every time I think I understand memory orders, life proves me wrong... Agreed that in 99% of code snippets in Graphene the additional memory fences (inserted by SEQ_CST) do not harm performance.

dimakuv commented 4 years ago

Actually only changing the 'weak' parameter also works, otherwise it does get stuck occasionally

@stefanberger, do you know why this happens? I never understood this weak parameter, tbh. IIUC, on x86 it is meaningless. And I know nothing about other architectures...

stefanberger commented 4 years ago

The difference in the generated code is one bne at the right spot that is missing in the weak version (causing 'spurious faults'). I had such a bne also in the assembly code that loops over the reading and writing of the data and indicates that some memory access reservation didn't hold, so the op needs to be repeated.

        if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/false,
   2d82c:       90 00 3f e9     ld      r9,144(r31)
   2d830:       08 00 09 39     addi    r8,r9,8
   2d834:       88 00 3f 39     addi    r9,r31,136
   2d838:       00 00 c9 80     lwz     r6,0(r9)
   2d83c:       01 00 e0 38     li      r7,1
   2d840:       28 40 40 7d     lwarx   r10,0,r8
   2d844:       00 30 8a 7f     cmpw    cr7,r10,r6
   2d848:       14 00 9e 40     bne     cr7,2d85c <_DkEventSet+0x78>
   2d84c:       2d 41 e0 7c     stwcx.  r7,0,r8
   2d850:       00 00 80 4f     mcrf    cr7,cr0
   2d854:       ec ff 9e 40     bne     cr7,2d840 <_DkEventSet+0x5c>  <----  THIS IS THE DIFFERENCE
   2d858:       2c 01 00 4c     isync
   2d85c:       26 10 10 7d     mfocrf  r8,1
   2d860:       fe ff 08 55     rlwinm  r8,r8,31,31,31
   2d864:       00 00 88 2f     cmpwi   cr7,r8,0
   2d868:       08 00 9e 40     bne     cr7,2d870 <_DkEventSet+0x8c>
   2d86c:       00 00 49 91     stw     r10,0(r9)
   2d870:       20 00 09 79     clrldi  r9,r8,32
   2d874:       00 00 a9 2f     cmpdi   cr7,r9,0
   2d878:       4c 01 9e 41     beq     cr7,2d9c4 <_DkEventSet+0x1e0>

        if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/true,
   2d82c:       90 00 3f e9     ld      r9,144(r31)
   2d830:       08 00 09 39     addi    r8,r9,8
   2d834:       88 00 3f 39     addi    r9,r31,136
   2d838:       00 00 c9 80     lwz     r6,0(r9)
   2d83c:       01 00 e0 38     li      r7,1
   2d840:       28 40 40 7d     lwarx   r10,0,r8
   2d844:       00 30 8a 7f     cmpw    cr7,r10,r6
   2d848:       10 00 9e 40     bne     cr7,2d858 <_DkEventSet+0x74>
   2d84c:       2d 41 e0 7c     stwcx.  r7,0,r8
   2d850:       00 00 80 4f     mcrf    cr7,cr0
   2d854:       2c 01 00 4c     isync
   2d858:       26 10 10 7d     mfocrf  r8,1
   2d85c:       fe ff 08 55     rlwinm  r8,r8,31,31,31
   2d860:       00 00 88 2f     cmpwi   cr7,r8,0
   2d864:       08 00 9e 40     bne     cr7,2d86c <_DkEventSet+0x88>
   2d868:       00 00 49 91     stw     r10,0(r9)
   2d86c:       20 00 09 79     clrldi  r9,r8,32
   2d870:       00 00 a9 2f     cmpdi   cr7,r9,0
   2d874:       4c 01 9e 41     beq     cr7,2d9c0 <_DkEventSet+0x1dc>

mkow commented 4 years ago

Another reason why not to move to GCC atomic built-ins: introducing memory corruptions is very easy when using them, as we saw at least 2 times in PRs using them (#1601 and one older, don't remember the number now. ofc not blaming the author, this is GCC's insanity, or maybe even a bug).

The problem is that things like __atomic_compare_exchange_n are declared as __atomic_compare_exchange_n (type *ptr, type *expected, type desired, [...]), but the type is not actually checked, and if arguments have different types the code compiles with a memory corruption and without any warnings.

PoC: https://godbolt.org/z/TxZhYo - cmpxchg should succeed there, but is always failing because of this bug. Take a look at the generated assembly to see what I mean. Also, changing unrelated_variable to 0 changes the result.

stefanberger commented 4 years ago

@mkow This type thing is obviously misleading. What would be safer would be a static assert on the size of the types but we'd have to wrap the function in a macro.

stefanberger commented 4 years ago

#include <inttypes.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#define __USE_ISOC11
#include <assert.h>

#define __ATOMIC_COMPARE_EXCHANGE_N(PTR, EXPECTED_PTR, DESIRED, WEAK, SUCC_MO, FAIL_MO) \
({\
    __typeof(*PTR) ret;\
    __typeof(*PTR) des = DESIRED; \
    static_assert(sizeof(*PTR) == sizeof(*EXPECTED_PTR), ""); \
    ret = __atomic_compare_exchange_n(PTR, EXPECTED_PTR, des, WEAK, SUCC_MO, FAIL_MO);\
    ret;\
})

int main()
{
    long int x = 0;
    long int t = 0;
    int unrelated_variable = 1;
    __atomic_load_n(&unrelated_variable, __ATOMIC_SEQ_CST); // just to stop optimizer from erasing `unrelated_variable`
    if (__ATOMIC_COMPARE_EXCHANGE_N(&x, &t, 1, /*weak=*/false, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED)) {
        printf("x = %" PRIu64 "\n", x);
    } else {
        puts("FAIL?");
    }
    return 0;
}

dimakuv commented 4 years ago

@mkow Very good argument. I didn't expect GCC to be so dumb... So with C11 atomics, such things will be caught as compile-time errors?

mkow commented 4 years ago

@dimakuv yes. See https://godbolt.org/z/y8o-A_:

#include <inttypes.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdatomic.h>

int main()
{
    uint64_t x = 0;
    uint32_t t = 0;
    int unrelated_variable = 0x1337;
    __atomic_load_n(&unrelated_variable, __ATOMIC_SEQ_CST); // just to stop optimizer from erasing `unrelated_variable`
    if (atomic_compare_exchange_strong_explicit(&x, &t, 1, memory_order_seq_cst, memory_order_relaxed)) {
        printf("x = %" PRIu64 "\n", x);
    } else {
        puts("FAIL?");
    }
    return 0;
}

Output:

<source>: In function 'main':

<source>:13:9: error: size mismatch in argument 2 of '__atomic_compare_exchange'

   13 |     if (atomic_compare_exchange_strong_explicit(&x, &t, 1, memory_order_seq_cst, memory_order_relaxed)) {

      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

btw. as you see, C11 atomics also work on native integer types, so, I think we could use them together with futexes (just using int + atomic_compare_exchange_strong_explicit in such cases instead of atomic_int).

mkow commented 4 years ago

btw. found that other PR with other occurrence of this bug, which even got to master: https://github.com/oscarlab/graphene/pull/1354

dimakuv commented 3 years ago

So apparently we need to use C11 atomics. Closing this issue.

mkow commented 3 years ago

Please don't close issues this way ;) We need to open a new issue, but about C11 atomics. So, I just opened #2576 for this.

gramineproject / graphene

[LibOS,Pal] Move to GCC atomic built-ins and fix memory orders #1593

Description of the problem