katsusan commented 3 years ago

1. summary

When instructions refer to different memory locations, with regards to loads/reads from memory and stores/writes to memory there are 4 possible issues:

writes can be reordered ahead of other reads
writes can be reordered ahead of other writes
reads can be reordered ahead of other reads
reads can be reordered ahead of other writes

x86架构下只有最后一种情况会发生，我们称之为strong memory model。举个例子，假设x和y初始值均为0，有两个线程分别执行下面每组指令：

// thread 1
mov [x],1   ;     向地址x写入1  
mov eax,[y] ;   从地址y读取到eax

// thread 2
mov [y],1   ;     向地址y写入1  
mov ebx,[x] ;   从地址x读取到ebx

在x86下eax和ebx可能的状态中有均为0的组合，按照常规逻辑由于读取x和y时至少有一组是赋值之后执行的，所以必定发生了reorder。每组线程写和读的地址又不一样，所以cpu难以判定dependency关系而决定不重排。

而像ARM、PowerPC它们上述4种情况都可能会发生，这种称之为weak memory model。

2. TSO model

Another way to be able to predict the x86 processor behaviour is to have a model where each hardware thread has a FIFO buffer for writes, while reads are not immediate, not buffered. When reading a memory location, it is first looked up in the FIFO buffer for writes.

x86下每个hardware thread有一个用于store的FIFO buffer，但读不会有缓冲，读的时候会先查询store buffer。这种称之为 total store ordering。对于互斥的内存访问时需要设置一个全局锁global lock，比如lock指令flush write buffer到内存。

refer: http://bajamircea.github.io/coding/cpp/2019/10/25/cpu-memory-model.html

katsusan commented 3 years ago

https://preshing.com/20120913/acquire-and-release-semantics/ http://blog.forecode.com/2010/01/29/barriers-to-understanding-memory-barriers/

Types of Memory Barrier：LoadLoad, LoadStore, StoreLoad, StoreStore.

Acquire semantics is a property that can only apply to operations that read from shared memory, whether they are read-modify-write operations or plain loads. The operation is then considered a read-acquire. Acquire semantics prevent memory reordering of the read-acquire with any read or write operation that follows it in program order.

LoadLoad+LoadStore prevents loads from moving down, and anything from moving up. Acquire语义可以用LoadLoad和LoadStore组合来实现。
Release semantics is a property that can only apply to operations that write to shared memory, whether they are read-modify-write operations or plain stores. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation that precedes it in program order.

StoreStore+LoadStore prevents store from moving up, and anything from moving down. Release语义可以用LoadStore和StoreStore来实现。

katsusan commented 3 years ago

note1: // why x86/amd64 are called strong memory model?

katsusan commented 3 years ago

note 2: // 64-ia-32-architectures-software-developer-manual // 8.2.2 Memory Ordering in P6 and More Recent Processor Families

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles (some exceptions are omitted):

Reads are not reordered with other reads.
Writes are not reordered with older reads.
Writes to memory are not reordered with other writes.
Reads may be reordered with older writes to different locations but not with older writes to the same location.
Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
Reads cannot pass earlier LFENCE and MFENCE instructions.
Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
LFENCE instructions cannot pass earlier reads.
SFENCE instructions cannot pass earlier writes.
MFENCE instructions cannot pass earlier reads, writes.

In a multiple-processor system, the following ordering principles apply:

Individual processors use the same ordering principles as in a single-processor system.
Writes by a single processor are observed in the same order by all processors.
Writes from an individual processor are NOT ordered with respect to the writes from other processors.
Memory ordering obeys causality (memory ordering respects transitive visibility).
Any two stores are seen in a consistent order by processors other than those performing the stores.
Locked instructions have a total order.

katsusan commented 3 years ago

note 3:

Neither Loads Nor Stores Are Reordered with Like Operations Intel-64 memory-ordering model 不允许load/store与同类型的operation重排序。比如线程P1和P2：

P1	P2
mov [_x], 1 mov [_y], 1	mov r1, [_y] mov r2, [_x]

初始化_x和_y都为0，这种情况下r1=1&r2=0不会发生。如果要发生需要P1的两个store或者P2的两个load发生reorder，
违反了前面所述的memory model。

Stores Are Not Reordered With Earlier Loads Intel-64 memory-ordering model 保证store操作不会被重排到previous load之前。比如线程P1和P2：

P1	P2
mov r1, [_x] mov [_y], 1	mov r2, [_y] mov [_x], 1

初始状态下_x和_y处均为0，则r1=1且r2=1的情况不会发生，原因：

假设r1=1，
则P2的mov[_x],1已经执行，
由于x86_64内存模型不允许store重排到previous load之前，故P2的mov r2, [_y]发生于mov [_x], 1之前，
同理P1的mov r1, [_x]发生于mov [_y], 1之前，
因此P2的mov r2, [_y]发生于P1的mov [_y], 1之前，即r2不可能为1。

Loads May Be Reordered with Earlier Stores to Different Locations Intel-64 memory-ordering model 允许在不同地址的前提下load可以重排到更早的store之前，但相同地址的情况下则不可。比如线程P1和P2：

P1	P2
mov [ _x], 1 mov r1, [ _y]	mov [ _y], 1 mov r2, [ _x]

初始_x和_y处均为0，则最后r1=0且r2=0的情况是允许的，对于每个线程，两次指令操作的地址不同，因此可能会将其重排序使得赋值给_x和_y前就取其值到寄存器r1/r2中。

而若执行的指令变为mov [_x], 1+mov r1, [_x]，则初始_x=0的情况下r1不可能为0，因其涉及到了同样的地址_x。

Intra-Processor Forwarding Is Allowed The memory-ordering model allows concurrent stores by two processors to be seen in different orders by those two processors. 比如线程P1和P2：

P1	P2
mov [_x], 1 mov r1, [_x] mov r2, [_y]	mov [_y], 1 mov r3, [_y] mov r4, [_x]

初始_x和_y处均为0，则r2=0且r4=0的情况有可能发生。处理器对每个Processor的store顺序不做约束(imposes no constraints)。

Stores Are Transitively Visible The memory-ordering model ensures transitive visibility of stores; stores that are causally related appear to all processors to occur in an order consistent with the causality relation. 比如线程P1、P2、P3：

P1	P2	P3
mov [_x], 1	mov r1, [_x] mov [_y], 1	mov r2, [_y] mov r3, [_x]

初始_x和_y均为0，则r1=1&r2=1&r3=0的情况不会发生。

假设r1=1&r2=1,
由于r1=1，故P1的store发生在P2的load之前，
按照2)的规则，store不会重排序到更早的load之前，故P2的load发生在其store之前，因此P1的store发生在P2的store之前(causally precede)
由于P1的store causally precede 于P2的store之前，内存模型保证了from the point of view of all processors都观测到同样的结果，
r2=1，故P2的store发生在P3的load之前，
由于x86内存模型不会重排序load(规则1)，因此P3的load按照顺序执行，
综上P1的store先于P2的load先于P3的load，故r3不可能为0.

Stores Are Seen in a Consistent Order by Other Processors As noted in Section 8.2.3.5, the memory-ordering model allows stores by two processors to be seen in different orders by those two processors. However, any two stores must appear to execute in the same order to all processors other than those performing the stores. 按照规则4的描述，x86允许两个processor的store被双方观测到不同的顺序(?TODO)。但任意两次store必须被其它processor观测到同样的执行顺序。
- 按照规则1，Processor 2和3的load不会被重排，
- 若r1=1&r2=0,则对于Processor 2来讲，Processor 0的store先于Processor 1的store，
- 同理对于Processor 3来讲，Processor 1的store先于Processor 0的store，
- 按照之前的说明，除了执行store的Processor之外，其它Processor应该观测到同样的store顺序，因此这种情况不会发生。
Locked Instructions Have a Total Order The memory-ordering model ensures that all processors agree on a single execution order of all locked instruc- tions, including those that are larger than 8 bytes or are not naturally aligned.

Loads and Stores Are Not Reordered with Locked Instructions The memory-ordering model prevents loads and stores from being reordered with locked instructions that execute earlier or later. // lock-free programming

katsusan / gowiki

cpu-memory-model #16

1. summary

2. TSO model