Open xry111 opened 7 months ago
Some basic tests show vst and vld seems atomic for aligned access (TODO: still need more testing and/or confirmation), so we can use them and dbar for atomic load and store.
To do this we need to implement the movti
pattern for reloading TImode into FP_REGS first.
Some basic tests show vst and vld seems atomic for aligned access (TODO: still need more testing and/or confirmation), so we can use them and dbar for atomic load and store.
Yes, see https://lists.gnu.org/archive/html/qemu-devel/2023-09/msg00439.html.
Also, the expected usage pattern for sc.q
: https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg09201.html
Some basic tests show vst and vld seems atomic for aligned access (TODO: still need more testing and/or confirmation), so we can use them and dbar for atomic load and store.
Yes, see https://lists.gnu.org/archive/html/qemu-devel/2023-09/msg00439.html
Thanks!
Also, the expected usage pattern for
sc.q
: https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg09201.html
I don't think it's correct. In https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg10341.html:
I think dbar 0x7000 [sic] ensures that the 2 loads in 'll.q' [sic] are a 128bit atomic operation.
This just makes no sense even if I change 0x7000 to 0x700 and ll.q to ll.d. The atomicity is guaranteed by the ll-sc loop, not ll instruction only.
The CPUCFG word 3 bit 5 is defined "the ll instruction includes a dbar semantic", and it's 1 for all real LoongArch hardware. @heiher once defined it as:
ll = <memory barrier> + <linked load>
And I'm not sure if there is another <memory barrier>
implied after
If my deductive is correct, dbar 0x700 won't be suffice because it only guarantees that the load to the same address are sequenced, but here we are loading from two different (though adjacent) addresses. So without an acquire barrier we may end up
ll.d + ld.d = <memory barrier> + <normal load reordered here> + <linked load>
This will blow up. So IMO we need
ll.acq.d + ld.d = <memory barrier> + <linked load> + <acquire barrier> + <normal load>
stop.
Also, the expected usage pattern for
sc.q
: https://lists.gnu.org/archive/html/qemu-devel/2023-10/msg09201.htmlI don't think it's correct. In https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg10341.html:
I think dbar 0x7000 [sic] ensures that the 2 loads in 'll.q' [sic] are a 128bit atomic operation.
This just makes no sense even if I change 0x7000 to 0x700 and ll.q to ll.d. The atomicity is guaranteed by the ll-sc loop, not ll instruction only.
The CPUCFG word 3 bit 5 is defined "the ll instruction includes a dbar semantic", and it's 1 for all real LoongArch hardware. @heiher once defined it as:
ll = <memory barrier> + <linked load>
And I'm not sure if there is another
<memory barrier>
implied after . It's likely "no" or why did we add ll.acq at all?!If my deductive is correct, dbar 0x700 won't be suffice because it only guarantees that the load to the same address are sequenced, but here we are loading from two different (though adjacent) addresses. So without an acquire barrier we may end up
ll.d + ld.d = <memory barrier> + <normal load reordered here> + <linked load>
This will blow up. So IMO we need
ll.acq.d + ld.d = <memory barrier> + <linked load> + <acquire barrier> + <normal load>
stop.
Thanks for pointing it out, it helps the community write good code.
Strict:
ll.d lo, addr + 0
dbar load/load (0b10101)
ld.d hi, addr + 8
sc.q lo, hi, addr
Using dbar 0x700
instead of dbar load/load
creates optimization possibilities for the microarchitecture. This is safe for existing hardwares and will be clarified in the specification in the future. (We might be able to think of the same address as the same cache line. :smile:
Using
dbar 0x700
instead ofdbar load/load
creates optimization possibilities for the microarchitecture. This is safe for existing hardwares and will be clarified in the specification in the future. (We might be able to think of the same address as the same cache line. 😄
So IIUC for a LA664 (or any CPU enumerating CPUCFG word 3 bit 23), two loads against the same cache line won't be reordered?
Using
dbar 0x700
instead ofdbar load/load
creates optimization possibilities for the microarchitecture. This is safe for existing hardwares and will be clarified in the specification in the future. (We might be able to think of the same address as the same cache line. 😄So IIUC for a LA664 (or any CPU enumerating CPUCFG word 3 bit 23), two loads against the same cache line won't be reordered?
For hardware LD_SEQ_SA, I can't confirm it now.
As GCC 14 stage 1 has ended now, deferring it into GCC 15.
My current understanding (maybe incorrect):
Some basic tests show vst and vld seems atomic for aligned access (TODO: still need more testing and/or confirmation), so we can use them and dbar for atomic load and store.
16-byte RMW operations:
16-byte CAS operation: