loongson-community / discussions

Cross-community issue tracker & discussions / 跨社区工单追踪 & 讨论场所
7 stars 0 forks source link

[GCC] 16-byte atomic (for GCC 15) #16

Open xry111 opened 7 months ago

xry111 commented 7 months ago

As GCC 14 stage 1 has ended now, deferring it into GCC 15.

My current understanding (maybe incorrect):

Some basic tests show vst and vld seems atomic for aligned access (TODO: still need more testing and/or confirmation), so we can use them and dbar for atomic load and store.

16-byte RMW operations:

// atomic_add_1(a0_ptr: *i128) -> i128;
1:
ll.acq.d t0, a0, 0 # "acq" to prevent a reorder with the next load operation?
ldptr.d a1, a0, 8
addi.d t0, t0, 1 # for example, atomically adding 1
sltui t2, t0, 0
addi a1, a1, t2
move t2, t0 # backup t0 because it'll be clobbered by sc.q
sc.q t0, a1, a0
beqz t0, 1b
move a0, t2
ret

16-byte CAS operation:

// atomic_cas(*a0_ptr: *i128, a1a2_exp: i128, a3a4_newval: i128) -> bool;
move t0, a0
1:
ll.acq.d t1, t0, 0
ldptr.d t2, t0, 8
bne t1, a1, 2f
bne t2, a2, 2f
move a0, a3
sc.q a0, a4, t0
beqz a0, 1b
b 3f
2:
move a0, zero
dbar 0b10100 # only when memorder_fail needs acq
3:
ret
xry111 commented 7 months ago

Some basic tests show vst and vld seems atomic for aligned access (TODO: still need more testing and/or confirmation), so we can use them and dbar for atomic load and store.

To do this we need to implement the movti pattern for reloading TImode into FP_REGS first.

jiegec commented 7 months ago

Some basic tests show vst and vld seems atomic for aligned access (TODO: still need more testing and/or confirmation), so we can use them and dbar for atomic load and store.

Yes, see https://lists.gnu.org/archive/html/qemu-devel/2023-09/msg00439.html.

Also, the expected usage pattern for sc.q: https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg09201.html

xry111 commented 7 months ago

Some basic tests show vst and vld seems atomic for aligned access (TODO: still need more testing and/or confirmation), so we can use them and dbar for atomic load and store.

Yes, see https://lists.gnu.org/archive/html/qemu-devel/2023-09/msg00439.html

Thanks!

xry111 commented 7 months ago

Also, the expected usage pattern for sc.q: https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg09201.html

I don't think it's correct. In https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg10341.html:

I think dbar 0x7000 [sic] ensures that the 2 loads in 'll.q' [sic] are a 128bit atomic operation.

This just makes no sense even if I change 0x7000 to 0x700 and ll.q to ll.d. The atomicity is guaranteed by the ll-sc loop, not ll instruction only.

The CPUCFG word 3 bit 5 is defined "the ll instruction includes a dbar semantic", and it's 1 for all real LoongArch hardware. @heiher once defined it as:

ll = <memory barrier> + <linked load>

And I'm not sure if there is another <memory barrier> implied after . It's likely "no" or why did we add ll.acq at all?!

If my deductive is correct, dbar 0x700 won't be suffice because it only guarantees that the load to the same address are sequenced, but here we are loading from two different (though adjacent) addresses. So without an acquire barrier we may end up

ll.d + ld.d = <memory barrier> + <normal load reordered here> + <linked load>

This will blow up. So IMO we need

ll.acq.d + ld.d = <memory barrier> + <linked load> + <acquire barrier> + <normal load>

stop.

heiher commented 7 months ago

Also, the expected usage pattern for sc.q: https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg09201.html

I don't think it's correct. In https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg10341.html:

I think dbar 0x7000 [sic] ensures that the 2 loads in 'll.q' [sic] are a 128bit atomic operation.

This just makes no sense even if I change 0x7000 to 0x700 and ll.q to ll.d. The atomicity is guaranteed by the ll-sc loop, not ll instruction only.

The CPUCFG word 3 bit 5 is defined "the ll instruction includes a dbar semantic", and it's 1 for all real LoongArch hardware. @heiher once defined it as:

ll = <memory barrier> + <linked load>

And I'm not sure if there is another <memory barrier> implied after . It's likely "no" or why did we add ll.acq at all?!

If my deductive is correct, dbar 0x700 won't be suffice because it only guarantees that the load to the same address are sequenced, but here we are loading from two different (though adjacent) addresses. So without an acquire barrier we may end up

ll.d + ld.d = <memory barrier> + <normal load reordered here> + <linked load>

This will blow up. So IMO we need

ll.acq.d + ld.d = <memory barrier> + <linked load> + <acquire barrier> + <normal load>

stop.

Thanks for pointing it out, it helps the community write good code.

Strict:

ll.d lo, addr + 0
dbar load/load (0b10101)
ld.d hi, addr + 8
sc.q lo, hi, addr

Using dbar 0x700 instead of dbar load/load creates optimization possibilities for the microarchitecture. This is safe for existing hardwares and will be clarified in the specification in the future. (We might be able to think of the same address as the same cache line. :smile:

xry111 commented 7 months ago

Using dbar 0x700 instead of dbar load/load creates optimization possibilities for the microarchitecture. This is safe for existing hardwares and will be clarified in the specification in the future. (We might be able to think of the same address as the same cache line. 😄

So IIUC for a LA664 (or any CPU enumerating CPUCFG word 3 bit 23), two loads against the same cache line won't be reordered?

heiher commented 7 months ago

Using dbar 0x700 instead of dbar load/load creates optimization possibilities for the microarchitecture. This is safe for existing hardwares and will be clarified in the specification in the future. (We might be able to think of the same address as the same cache line. 😄

So IIUC for a LA664 (or any CPU enumerating CPUCFG word 3 bit 23), two loads against the same cache line won't be reordered?

For hardware LD_SEQ_SA, I can't confirm it now.