Closed dimakuv closed 3 years ago
FYI: After PR #1574 I have to work with this patch now on ppc64 to make the spinlock testcase work reliably.
diff --git a/Pal/src/host/Linux/db_events.c b/Pal/src/host/Linux/db_events.c
index d53d3585..16837ec7 100644
--- a/Pal/src/host/Linux/db_events.c
+++ b/Pal/src/host/Linux/db_events.c
@@ -37,8 +37,8 @@ int _DkEventSet(PAL_HANDLE event, int wakeup) {
if (event->event.isnotification) {
// Leave it signaled, wake all
uint32_t t = 0;
- if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/true,
- __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
+ if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/false,
+ __ATOMIC_SEQ_CST, __ATOMIC_RELAXED)) {
int nwaiters = atomic_read(&event->event.nwaiters);
if (nwaiters) {
if (wakeup != -1 && nwaiters > wakeup)
diff --git a/Pal/src/host/Linux/db_mutex.c b/Pal/src/host/Linux/db_mutex.c
index facde2df..be22b100 100644
--- a/Pal/src/host/Linux/db_mutex.c
+++ b/Pal/src/host/Linux/db_mutex.c
@@ -81,8 +81,8 @@ int _DkMutexLockTimeout(struct mutex_handle* m, int64_t timeout_us) {
* the timeout.*/
for (i = 0; i < iterations; i++) {
uint32_t t = MUTEX_UNLOCKED;
- if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/true,
- __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
+ if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/false,
+ __ATOMIC_SEQ_CST, __ATOMIC_RELAXED))
goto success;
CPU_RELAX();
}
@@ -97,8 +97,8 @@ int _DkMutexLockTimeout(struct mutex_handle* m, int64_t timeout_us) {
while (true) {
uint32_t t = MUTEX_UNLOCKED;
- if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/true,
- __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
+ if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/false,
+ __ATOMIC_SEQ_CST, __ATOMIC_RELAXED))
break;
struct timespec waittime, *waittimep = NULL;
--
2.17.1
Actually only changing the 'weak' parameter also works, otherwise it does get stuck occasionally:
diff --git a/Pal/src/host/Linux/db_events.c b/Pal/src/host/Linux/db_events.c
index d53d3585..780d6731 100644
--- a/Pal/src/host/Linux/db_events.c
+++ b/Pal/src/host/Linux/db_events.c
@@ -37,7 +37,7 @@ int _DkEventSet(PAL_HANDLE event, int wakeup) {
if (event->event.isnotification) {
// Leave it signaled, wake all
uint32_t t = 0;
- if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/true,
+ if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/false,
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
int nwaiters = atomic_read(&event->event.nwaiters);
if (nwaiters) {
diff --git a/Pal/src/host/Linux/db_mutex.c b/Pal/src/host/Linux/db_mutex.c
index facde2df..9de4f6ab 100644
--- a/Pal/src/host/Linux/db_mutex.c
+++ b/Pal/src/host/Linux/db_mutex.c
@@ -81,7 +81,7 @@ int _DkMutexLockTimeout(struct mutex_handle* m, int64_t timeout_us) {
* the timeout.*/
for (i = 0; i < iterations; i++) {
uint32_t t = MUTEX_UNLOCKED;
- if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/true,
+ if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/false,
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
goto success;
CPU_RELAX();
@@ -97,7 +97,7 @@ int _DkMutexLockTimeout(struct mutex_handle* m, int64_t timeout_us) {
while (true) {
uint32_t t = MUTEX_UNLOCKED;
- if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/true,
+ if (__atomic_compare_exchange_n(&m->locked, &t, MUTEX_LOCKED, /*weak=*/false,
__ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
break;
- Also switch from C11 atomics to GCC built-ins (thankfully, all modern compilers support GCC-style atomics)
I'm still not convinced about using them in places where we aren't forced to (i.e. where we don't use underlying variables also for futex()
). #1510 radicalized me on this, the code looks hardly readable in places where we use builtin atomics.
- Improve memory orders, e.g. relax __ATOMIC_SEQ_CST to the appropriate ones.
I don't think we have to do this unless we have a benchmark showing that a particular component is noticeably slow. Doing this correctly is very hard and we're risking correctness issues, with most likely negligible speedups.
I don't think we have to do this unless we have a benchmark showing that a particular component is noticeably slow. Doing this correctly is very hard and we're risking correctness issues, with most likely negligible speedups.
... and this is why the series of patches I sent today takes a conservative approach with the memory orders. It's quite a few patches with possibly individual decisions about better memory order to make, so from this perspective PR #1601 could be a pain.
I don't think we have to do this unless we have a benchmark showing that a particular component is noticeably slow. Doing this correctly is very hard and we're risking correctness issues, with most likely negligible speedups.
True. Every time I think I understand memory orders, life proves me wrong... Agreed that in 99% of code snippets in Graphene the additional memory fences (inserted by SEQ_CST) do not harm performance.
Actually only changing the 'weak' parameter also works, otherwise it does get stuck occasionally
@stefanberger, do you know why this happens? I never understood this weak
parameter, tbh. IIUC, on x86 it is meaningless. And I know nothing about other architectures...
The difference in the generated code is one bne
at the right spot that is missing in the weak version (causing 'spurious faults'). I had such a bne
also in the assembly code that loops over the reading and writing of the data and indicates that some memory access reservation didn't hold, so the op needs to be repeated.
if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/false,
2d82c: 90 00 3f e9 ld r9,144(r31)
2d830: 08 00 09 39 addi r8,r9,8
2d834: 88 00 3f 39 addi r9,r31,136
2d838: 00 00 c9 80 lwz r6,0(r9)
2d83c: 01 00 e0 38 li r7,1
2d840: 28 40 40 7d lwarx r10,0,r8
2d844: 00 30 8a 7f cmpw cr7,r10,r6
2d848: 14 00 9e 40 bne cr7,2d85c <_DkEventSet+0x78>
2d84c: 2d 41 e0 7c stwcx. r7,0,r8
2d850: 00 00 80 4f mcrf cr7,cr0
2d854: ec ff 9e 40 bne cr7,2d840 <_DkEventSet+0x5c> <---- THIS IS THE DIFFERENCE
2d858: 2c 01 00 4c isync
2d85c: 26 10 10 7d mfocrf r8,1
2d860: fe ff 08 55 rlwinm r8,r8,31,31,31
2d864: 00 00 88 2f cmpwi cr7,r8,0
2d868: 08 00 9e 40 bne cr7,2d870 <_DkEventSet+0x8c>
2d86c: 00 00 49 91 stw r10,0(r9)
2d870: 20 00 09 79 clrldi r9,r8,32
2d874: 00 00 a9 2f cmpdi cr7,r9,0
2d878: 4c 01 9e 41 beq cr7,2d9c4 <_DkEventSet+0x1e0>
if (__atomic_compare_exchange_n(&event->event.signaled, &t, 1, /*weak=*/true,
2d82c: 90 00 3f e9 ld r9,144(r31)
2d830: 08 00 09 39 addi r8,r9,8
2d834: 88 00 3f 39 addi r9,r31,136
2d838: 00 00 c9 80 lwz r6,0(r9)
2d83c: 01 00 e0 38 li r7,1
2d840: 28 40 40 7d lwarx r10,0,r8
2d844: 00 30 8a 7f cmpw cr7,r10,r6
2d848: 10 00 9e 40 bne cr7,2d858 <_DkEventSet+0x74>
2d84c: 2d 41 e0 7c stwcx. r7,0,r8
2d850: 00 00 80 4f mcrf cr7,cr0
2d854: 2c 01 00 4c isync
2d858: 26 10 10 7d mfocrf r8,1
2d85c: fe ff 08 55 rlwinm r8,r8,31,31,31
2d860: 00 00 88 2f cmpwi cr7,r8,0
2d864: 08 00 9e 40 bne cr7,2d86c <_DkEventSet+0x88>
2d868: 00 00 49 91 stw r10,0(r9)
2d86c: 20 00 09 79 clrldi r9,r8,32
2d870: 00 00 a9 2f cmpdi cr7,r9,0
2d874: 4c 01 9e 41 beq cr7,2d9c0 <_DkEventSet+0x1dc>
Another reason why not to move to GCC atomic built-ins: introducing memory corruptions is very easy when using them, as we saw at least 2 times in PRs using them (#1601 and one older, don't remember the number now. ofc not blaming the author, this is GCC's insanity, or maybe even a bug).
The problem is that things like __atomic_compare_exchange_n
are declared as __atomic_compare_exchange_n (type *ptr, type *expected, type desired, [...])
, but the type
is not actually checked, and if arguments have different types the code compiles with a memory corruption and without any warnings.
PoC: https://godbolt.org/z/TxZhYo - cmpxchg should succeed there, but is always failing because of this bug. Take a look at the generated assembly to see what I mean. Also, changing unrelated_variable
to 0
changes the result.
@mkow This type
thing is obviously misleading. What would be safer would be a static assert on the size of the types but we'd have to wrap the function in a macro.
#include <inttypes.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#define __USE_ISOC11
#include <assert.h>
#define __ATOMIC_COMPARE_EXCHANGE_N(PTR, EXPECTED_PTR, DESIRED, WEAK, SUCC_MO, FAIL_MO) \
({\
__typeof(*PTR) ret;\
__typeof(*PTR) des = DESIRED; \
static_assert(sizeof(*PTR) == sizeof(*EXPECTED_PTR), ""); \
ret = __atomic_compare_exchange_n(PTR, EXPECTED_PTR, des, WEAK, SUCC_MO, FAIL_MO);\
ret;\
})
int main()
{
long int x = 0;
long int t = 0;
int unrelated_variable = 1;
__atomic_load_n(&unrelated_variable, __ATOMIC_SEQ_CST); // just to stop optimizer from erasing `unrelated_variable`
if (__ATOMIC_COMPARE_EXCHANGE_N(&x, &t, 1, /*weak=*/false, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED)) {
printf("x = %" PRIu64 "\n", x);
} else {
puts("FAIL?");
}
return 0;
}
@mkow Very good argument. I didn't expect GCC to be so dumb... So with C11 atomics, such things will be caught as compile-time errors?
@dimakuv yes. See https://godbolt.org/z/y8o-A_:
#include <inttypes.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdatomic.h>
int main()
{
uint64_t x = 0;
uint32_t t = 0;
int unrelated_variable = 0x1337;
__atomic_load_n(&unrelated_variable, __ATOMIC_SEQ_CST); // just to stop optimizer from erasing `unrelated_variable`
if (atomic_compare_exchange_strong_explicit(&x, &t, 1, memory_order_seq_cst, memory_order_relaxed)) {
printf("x = %" PRIu64 "\n", x);
} else {
puts("FAIL?");
}
return 0;
}
Output:
<source>: In function 'main':
<source>:13:9: error: size mismatch in argument 2 of '__atomic_compare_exchange'
13 | if (atomic_compare_exchange_strong_explicit(&x, &t, 1, memory_order_seq_cst, memory_order_relaxed)) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
btw. as you see, C11 atomics also work on native integer types, so, I think we could use them together with futexes (just using int
+ atomic_compare_exchange_strong_explicit
in such cases instead of atomic_int
).
btw. found that other PR with other occurrence of this bug, which even got to master: https://github.com/oscarlab/graphene/pull/1354
So apparently we need to use C11 atomics. Closing this issue.
Please don't close issues this way ;) We need to open a new issue, but about C11 atomics. So, I just opened #2576 for this.
Description of the problem
We started moving from musl-borrowed
atomic.h
primitives to GCC atomic built-ins (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html). We need to complete this transition. As part of this transition, we must:__ATOMIC_SEQ_CST
to the appropriate ones.