Open hnhyzz opened 2 years ago
It looks like you are linking an older version of the code, it now looks like we do an atomicGet on that value, https://github.com/redis/redis/blob/3a16ad30b7a7918f036b23c6f5cf079c04baab05/src/networking.c#L4047. This should not be re-ordered.
It looks like you are linking an older version of the code, it now looks like we do an atomicGet on that value,
https://github.com/redis/redis/blob/3a16ad30b7a7918f036b23c6f5cf079c04baab05/src/networking.c#L4047
. This should not be re-ordered.
https://github.com/redis/redis/blob/e6f67092f8d4d81761a60c46011d1ff1dc3c2628/src/networking.c#L3518 I have checked the new version, and found atomicGetWithSync() is used to get the value of the counter. atomicGetWithSync() uses atomic_load_explicit with memory_order_seq_cst to fetch memory.
However, still there is no memory barrier added, as I disassembled the binary and found no mfence.
I see. I think what you are proposing is practically possible, some parts of the list checking could be re-ordered before the atomic counter fetch and potentially contain some corrupt information. Practically we should be atomically sending the list to the background threads, not atomically updating an indicator counter.
I see. I think what you are proposing is practically possible, some parts of the list checking could be re-ordered before the atomic counter fetch and potentially contain some corrupt information. Practically we should be atomically sending the list to the background threads, not atomically updating an indicator counter.
I think add a memory fence before and after the getIOPendingCount in both IOthread and main thread can make accesing the list synchronized. In this case, the counter acts as a mutex between the two threads.
I think add a memory fence before and after the getIOPendingCount in both IOthread and main thread can make accesing the list synchronized. In this case, the counter acts as a mutex between the two threads.
Also works.
BTW, the list is shared across different threads, but it not referenced by volatile pointers. The compiler may reorder it.
As far as i know, atomicGetWithSync
and atomicSetWithSync
has already provide the strongest sequence consistency, which is the same as std::memory_order_seq_cst
in C++11.
I also try disassemble the code with command gcc -O0 -I../deps/hiredis -I../deps/linenoise -I../deps/lua/src -I../deps/hdr_histogram -S networking.c -std=c99
, got something like this:
getIOPendingCount:
.LFB169:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -20(%rbp)
movq $0, -8(%rbp)
movl -20(%rbp), %eax
cltq
salq $3, %rax
addq $io_threads_pending, %rax
movq (%rax), %rax
movq %rax, -8(%rbp)
movq -8(%rbp), %rax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE169:
.size getIOPendingCount, .-getIOPendingCount
.type setIOPendingCount, @function
setIOPendingCount:
.LFB170:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl -4(%rbp), %eax
cltq
salq $3, %rax
leaq io_threads_pending(%rax), %rdx
movq -16(%rbp), %rax
movq %rax, (%rdx)
mfence
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
There is a mfence in setIOPendingCount, and nothing in getIOPendingCount. After a search I found gcc prefer STORE+MFENCE to guarantee the order(see https://stackoverflow.com/questions/19047327/why-gcc-does-not-use-loadwithout-fence-and-storesfence-for-sequential-consist), so I think maybe we should just trust the compiler and no need for an extra memory barrier.
@uvletter @madolson I see. As x86 is TSO, write followed by a mfence is enough. I guess it would be different on ARM. How about making the list volatile? If it is not marked as volatile, the compiler may not consider it observable by another thread, and may introduce over-optimization.
@hnhyzz I still believe the language standard define the semantics about the interfaces, memory model and so on, and compiler would obey the standards and hide the variety between hardwares, OSs from us. Maybe the compiler doesn't fully support the standards in some specific platforms, or other issues, if it occurs in reality we could fix it at that time. As discussed in https://github.com/redis/redis/pull/11057, I have little experiences about volatile, but I guess maybe you mix up the volatile in C and Java. In Java volatile make the variable visible between threads, but In C it only flushes the register to memory if i remember correctly.
@uvletter I agree that volatile in C only forces compiler to flush variables from registers to memory. If the list entry is not referred by volatile pointer, the latest version may stay in register, so other threads may observe the stale data in memory. Admittedly, the chances are negligible, as the compiler may not be that aggresive in holding variable in register, but there are still theoretical chances.
@hnhyzz In semantics memory_order_seq_cst
would make all load/store surrounding it is sequent, for instance, after a store, the following load would observe the new stored value even in other threads. In implementation, the list entry wouldn't stay in register but in cache/memory, only temporary state/variable of function is in register, movq %rax, (%rdx)
in setIOPendingCount
can prove it, where %rdx is the address of io_threads_pending element, and %rax is the new value. Stale value may be observed in L1 cache without extra synchronization, so a mfence
is added here.
@uvletter My understanding (according to the following cppreference) to memory_order_seq_cst in an atomic get scenario is that it only applys a load acquire operation to the memory to be loaded, plus single total modification ordering guarantees to other atomic operations.
As x86 is TSO, load acquire is already guaranteed (no later read or write can be reordered to before a read), so the compiler does nothing more that a normal load, which is also what we have seen from the disassembled getIOPendingCount.
The single total modification ordering is only guaranteed relative to other atomic operations. This is implemented by inserting mfence to atomic stores, which we also have seen in setIOPendingCount. Therefore, this rule only guarantees cross-threads operation to the atomic variable, i.e., the counter.
https://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering
If my understanding is correct, getIOPendingCount does not guarantee that non-atomic tagged variable is flushed into memory, including the list. According to the disassembly, it is indeed flushed to memory. But my understanding is that it is not guaranteed.
As far as I know there're no instruction to explicitly control fetching or flushing a variable from/to memory, the cache mapping is manipulated by hardware, transparent to software. For movq %rax, (%rdx)
, (%rdx)
is the variable who's address stored in %rdx, But it doesn't mean CPU move the register value %rax to memory (%rdx), but just to cache, the CPU will write back the value from cache to memory at a proper time. It's the hardware's design, not visible to software. In software's perspective, it only operate registers and memory.
https://github.com/redis/redis/blob/5895d119b1c2825ff0394f30e246e036c3972bc5/src/networking.c#L3091
Relaxed memory model CPU may reorder reading later reading the list, so stale list may be read.