No memory barrier after reading io_threads_pending

hnhyzz commented 2 years ago

https://github.com/redis/redis/blob/5895d119b1c2825ff0394f30e246e036c3972bc5/src/networking.c#L3091

Relaxed memory model CPU may reorder reading later reading the list, so stale list may be read.

madolson commented 2 years ago

It looks like you are linking an older version of the code, it now looks like we do an atomicGet on that value, https://github.com/redis/redis/blob/3a16ad30b7a7918f036b23c6f5cf079c04baab05/src/networking.c#L4047. This should not be re-ordered.

hnhyzz commented 2 years ago

It looks like you are linking an older version of the code, it now looks like we do an atomicGet on that value,

https://github.com/redis/redis/blob/3a16ad30b7a7918f036b23c6f5cf079c04baab05/src/networking.c#L4047

. This should not be re-ordered.

https://github.com/redis/redis/blob/e6f67092f8d4d81761a60c46011d1ff1dc3c2628/src/networking.c#L3518 I have checked the new version, and found atomicGetWithSync() is used to get the value of the counter. atomicGetWithSync() uses atomic_load_explicit with memory_order_seq_cst to fetch memory.

However, still there is no memory barrier added, as I disassembled the binary and found no mfence.

madolson commented 2 years ago

I see. I think what you are proposing is practically possible, some parts of the list checking could be re-ordered before the atomic counter fetch and potentially contain some corrupt information. Practically we should be atomically sending the list to the background threads, not atomically updating an indicator counter.

hnhyzz commented 2 years ago

I see. I think what you are proposing is practically possible, some parts of the list checking could be re-ordered before the atomic counter fetch and potentially contain some corrupt information. Practically we should be atomically sending the list to the background threads, not atomically updating an indicator counter.

I think add a memory fence before and after the getIOPendingCount in both IOthread and main thread can make accesing the list synchronized. In this case, the counter acts as a mutex between the two threads.

madolson commented 2 years ago

I think add a memory fence before and after the getIOPendingCount in both IOthread and main thread can make accesing the list synchronized. In this case, the counter acts as a mutex between the two threads.

Also works.

hnhyzz commented 2 years ago

BTW, the list is shared across different threads, but it not referenced by volatile pointers. The compiler may reorder it.

uvletter commented 2 years ago

As far as i know, atomicGetWithSync and atomicSetWithSync has already provide the strongest sequence consistency, which is the same as std::memory_order_seq_cst in C++11. I also try disassemble the code with command gcc -O0 -I../deps/hiredis -I../deps/linenoise -I../deps/lua/src -I../deps/hdr_histogram -S networking.c -std=c99, got something like this:

getIOPendingCount:
.LFB169:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        movl    %edi, -20(%rbp)
        movq    $0, -8(%rbp)
        movl    -20(%rbp), %eax
        cltq
        salq    $3, %rax
        addq    $io_threads_pending, %rax
        movq    (%rax), %rax
        movq    %rax, -8(%rbp)
        movq    -8(%rbp), %rax
        popq    %rbp
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE169:
        .size   getIOPendingCount, .-getIOPendingCount
        .type   setIOPendingCount, @function
setIOPendingCount:
.LFB170:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        movl    %edi, -4(%rbp)
        movq    %rsi, -16(%rbp)
        movl    -4(%rbp), %eax
        cltq
        salq    $3, %rax
        leaq    io_threads_pending(%rax), %rdx
        movq    -16(%rbp), %rax
        movq    %rax, (%rdx)
        mfence
        popq    %rbp
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc

There is a mfence in setIOPendingCount, and nothing in getIOPendingCount. After a search I found gcc prefer STORE+MFENCE to guarantee the order(see https://stackoverflow.com/questions/19047327/why-gcc-does-not-use-loadwithout-fence-and-storesfence-for-sequential-consist), so I think maybe we should just trust the compiler and no need for an extra memory barrier.

hnhyzz commented 2 years ago

@uvletter @madolson I see. As x86 is TSO, write followed by a mfence is enough. I guess it would be different on ARM. How about making the list volatile? If it is not marked as volatile, the compiler may not consider it observable by another thread, and may introduce over-optimization.

uvletter commented 2 years ago

@hnhyzz I still believe the language standard define the semantics about the interfaces, memory model and so on, and compiler would obey the standards and hide the variety between hardwares, OSs from us. Maybe the compiler doesn't fully support the standards in some specific platforms, or other issues, if it occurs in reality we could fix it at that time. As discussed in https://github.com/redis/redis/pull/11057, I have little experiences about volatile, but I guess maybe you mix up the volatile in C and Java. In Java volatile make the variable visible between threads, but In C it only flushes the register to memory if i remember correctly.

hnhyzz commented 2 years ago

@uvletter I agree that volatile in C only forces compiler to flush variables from registers to memory. If the list entry is not referred by volatile pointer, the latest version may stay in register, so other threads may observe the stale data in memory. Admittedly, the chances are negligible, as the compiler may not be that aggresive in holding variable in register, but there are still theoretical chances.

uvletter commented 2 years ago

@hnhyzz In semantics memory_order_seq_cst would make all load/store surrounding it is sequent, for instance, after a store, the following load would observe the new stored value even in other threads. In implementation, the list entry wouldn't stay in register but in cache/memory, only temporary state/variable of function is in register, movq %rax, (%rdx) in setIOPendingCount can prove it, where %rdx is the address of io_threads_pending element, and %rax is the new value. Stale value may be observed in L1 cache without extra synchronization, so a mfence is added here.

hnhyzz commented 2 years ago

@uvletter My understanding (according to the following cppreference) to memory_order_seq_cst in an atomic get scenario is that it only applys a load acquire operation to the memory to be loaded, plus single total modification ordering guarantees to other atomic operations.

As x86 is TSO, load acquire is already guaranteed (no later read or write can be reordered to before a read), so the compiler does nothing more that a normal load, which is also what we have seen from the disassembled getIOPendingCount.

The single total modification ordering is only guaranteed relative to other atomic operations. This is implemented by inserting mfence to atomic stores, which we also have seen in setIOPendingCount. Therefore, this rule only guarantees cross-threads operation to the atomic variable, i.e., the counter.

https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering

If my understanding is correct, getIOPendingCount does not guarantee that non-atomic tagged variable is flushed into memory, including the list. According to the disassembly, it is indeed flushed to memory. But my understanding is that it is not guaranteed.

uvletter commented 2 years ago

As far as I know there're no instruction to explicitly control fetching or flushing a variable from/to memory, the cache mapping is manipulated by hardware, transparent to software. For movq %rax, (%rdx), (%rdx) is the variable who's address stored in %rdx, But it doesn't mean CPU move the register value %rax to memory (%rdx), but just to cache, the CPU will write back the value from cache to memory at a proper time. It's the hardware's design, not visible to software. In software's perspective, it only operate registers and memory.

redis / redis

No memory barrier after reading io_threads_pending #11152