riscvarchive / riscv-zacas

riscv-zacas created from docs-spec-template template
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
9 stars 6 forks source link

Memory ordering for failed AMOCAS #20

Closed hboehm closed 1 year ago

hboehm commented 1 year ago

I'm not sure this needs to, or should be addressed, but I wanted to make sure people are aware of the issue:

In C++ a compare_exchange that fails is treated as a load for memory model issues. "If the operation returns true, these operations are atomic read-modify-write operations (6.9.2) on the memory pointed to by this. Otherwise, these operations are atomic load operations on that memory." C++ load operations don't provide release ordering (which wouldn't make sense in that model). Thus there is no requirement that ordinary prior data operations be ordered with respect to even a seq_cst compare_exchange that fails.

However, IIUC, with the Zacas proposal, any release ordering specified for AMOCAS also applies to the load in the failing case, when there is no store. So a much stronger ordering is enforced for the failing case than C++ requires.

In most use cases we would hope that AMOCAS succeeds nearly all of the time. And on failures, we would commonly end up waiting to retry anyway, making this less important. But I suspect there are occasional exceptions, where the CAS is used to replace a fixed value, and failure is more common. There may also be cases in which we can do other useful work after a failure before retrying, so failure performance matters.

We believe that ARM handles this differently, so RISC-V is somewhat breaking new ground here, though x86 presumably has even stronger semantics. Was the difference here considered/intentional? 

I'm not sure this is currently wrong. The current wording certainly seems more consistent with the rest of the architecture. And I think it successfully avoids touching the Ztso spec. But it just seemed a bit risky to me, and I wanted to make sure your group is aware of the issue.

ved-rivos commented 1 year ago

The current wordings followed the rest of the architecture where the load has release semantics when aq=rl=1. I am not sure this is necessarily wrong though. For case where the operation was expected to almost always fail it may perhaps be okay setting rl=0 and using a fence rw, w in the rare case where the operation happens to succeed. @aswaterman @gfavor - what is your opinion?

aswaterman commented 1 year ago

My sense is that we should not be unduly strict here, and that suggests that we should offer a CAS that has SC semantics when successful and only acquire semantics when unsuccessful. (Assuming my interpretation of C++ is correct, that is.) I suppose that means that, if we define the aq=rl=1 versions to have this behavior, then anyone who really wants a fully SC version can obtain it by executing a FENCE on the failure path.

ved-rivos commented 1 year ago

Suggest we define the AMOCAS release consistency semantics are as follows:

aswaterman commented 1 year ago

Sounds right to me... @hboehm?

hboehm commented 1 year ago

In principle, that sounds right to me. It may be a bit tricky to get the wording right, and it would be good to get one or more of the original RVWMO authors to chime in.

I believe the RVWMO intent is that acquire-release annotations are associated with "memory operations". At least PPO clauses 5-7 treat them that way. And "An aligned AMO gives rise to a single memory operation that is both a load operation and a store operation simultaneously." So I don't think acquire/release is specifically associated with a load or store. So I think the actual formulation should be more like:

If the aq bit is set then the memory operation performed by AMOCAS has acquire semantics. (Or, more accurately, has an RCsc acquire annotation.)

If the rl bit is set, and if the compare operation succeeds, then the memory operation performed by AMOCAS has release semantics. (Or, more accurately, has an RCsc release annotation.)

And I think that's actually it. I don't think RVWMO deals with "sequentially consistent" in this form. I would make any such statement non-normative, though I realize we say that elsewhere.

I think there's generally a subtle distinction between AMO.. semantics and LR/SC semantics which makes statements about sequential consistency less informative than people might think: I believe the two memory operations in

mem-op1; AMO...aqrl; mem-op2

are ordered (and hence AMO...aqrl seems to act as a fence), while those in

mem-op1; LR.aq; SC.rl; mem-op2

are not. Either version suffices for C++ memory_order_seq_cst, and to ensure that programs using only sequentially consistent operations appear sequentially consistent. But they're not the same; semantics differ once you mix in non-sequentially-consistent operations. And I think the Linux kernel sometimes wants the stronger semantics. There are some issues with the stronger semantics in a C++-context, but I think at the architecture level we're basically trading off additional ordering properties against (an unknown to me, and possibly small amount of) performance.

aswaterman commented 1 year ago

An aligned AMO gives rise to a single memory operation that is both a load operation and a store operation simultaneously

Even setting aside the aq/rl matter, this raises the question of whether we want AMOCAS to be treated in the same way as other AMOs. For purposes of other PPO rules, should we not say that a failed AMOCAS (with aq=rl=0) is only guaranteed to be ordered as a load? Or does that open a can of worms I'm not thinking of?

ved-rivos commented 1 year ago

I think the annotations are the aq and rl bits. Acquire semantics can be enforced using acquire annotation and release semantics using release annotation. So the wordings "acquire semantics" and "release semantics" is right. Let me know if you agree.

The A-extension extensively discusses sequential consistency but I agree the RVWMO section or its explanatory section does not. Also agree we do not need it for this discussion.

While the RVWMO section states "is a load and a store simultaneously" I did not think it mean that there is there is a new type of memory operation that is both a load and store because an AMO by definition is a load-op-store. I interpret that wording as being in reference the instruction and not the operations performed by the instruction. As an instruction it performs a load and a store simultaneously. In case of AMOCAS, the instruction performs a load and store simultaneously when successful and just a load when not successful.

If this interpretation is right then we can reword as follows:

Is this better?

gfavor commented 1 year ago

On Tue, Jul 25, 2023, 5:51 PM ved-rivos @.***> wrote:

If this interpretation is right then we can reword as follows:

  • The memory operation performed by AMOCAS, when successful, has acquire semantics if aq bit is 1 and has release semantics if rl bit is 1.
  • The memory operation performed by AMOCAS, when not successful, has acquire semantics if aq bit is 1.

Is this better?

Looks good to me. And I agree with the interpretation.

Greg

gfavor commented 1 year ago

On Tue, Jul 25, 2023, 5:39 PM Andrew Waterman @.***> wrote:

Even setting aside the aq/rl matter, this raises the question of whether we want AMOCAS to be treated in the same way as other AMOs. For purposes of other PPO rules, should we not say that a failed AMOCAS (with aq=rl=0) is only guaranteed to be ordered as a load? Or does that open a can of worms I'm not thinking of.

This seems especially important to specify for the sake of interactions with FENCE instructions, i.e. does AMOCAS always get ordered by fences as a store (as well as a load), or only when it succeeds.

Off-hand maybe the latter is more appropriate given that we're already differentiating wrt the rl semantic. So, should we not be consistent in this regard? Or is there a software use case that would want store ordering wrt a FENCE even if the cas fails?

Greg

ved-rivos commented 1 year ago

I think of a fence as ordering memory accesses produced by instructions and not instructions. If a fence orders memory reads and writes - including memory read and a possible memory write generated by amocas - then it does not need to be aware of whether the amocas produced a memory write or not.

ved-rivos commented 1 year ago

Perhaps a more explicit statement? An amocas generates a memory read access and, if successful, generates a memory write access. This should also address a later load that reads the location that may be written by amocas

aswaterman commented 1 year ago

Off-hand maybe the latter is more appropriate given that we're already differentiating wrt the rl semantic.

I was thinking along the same lines.

An amocas generates a memory read access and, if successful, generates a memory write access.

Yeah, something like that.

Separate from memory ordering, I think we should make two points clear. The amocas may give way to a write (of the old value) if the comparison fails; this permits implementations to action store side effects or kill load reservations even if the comparison fails. Also, write permissions are required regardless of success.

ved-rivos commented 1 year ago

How about expanding as follows:

An amocas generates a memory read access and, if successful, generates a memory write access to store the swap value. An unsuccessful amocas may either not perform a memory write or may write back the old value loaded from memory. An amocas instruction always requires write permissions.

The amocas optionally provides release consistency semantics, using the aq and rl bits, to help implement multiprocessor synchronization. The memory operation performed by an amocas, when successful, has acquire semantics if aq bit is 1 and has release semantics if rl bit is 1. The memory operation performed by an amocas, when not successful, has acquire semantics if aq bit is 1.

aswaterman commented 1 year ago

Seems like it's shaping up. Maybe add something like "but does not have release semantics, regardless of rl" to the last sentence for additional clarity.

ved-rivos commented 1 year ago

Please see PR #21 - https://github.com/riscv/riscv-zacas/pull/21/commits/27bd1a52130c04daf6d9f5181a7d7a5c61504c5c

aswaterman commented 1 year ago

We should wait for @hboehm to opine, too.

hboehm commented 1 year ago

I mostly like the last formulation, though I'm nervous about "always requires write permissions".

Are there other architectures with that requirement? This is unnatural for the compare_exchange definition in the C++ standard. It doesn't clearly violate the standard, since that doesn't talk about access permissions. It's a a bit of a corner case, but it's not hard to think of cases in which it would be exercised. (E.g. one thread updates obsolete data in a memory region with a CAS to check for the obsolete data, while a concurrent thread is reading the memory and using CAS to check for and update individual obsolete entries on demand.) If ARM or x86 do this, we're probably fairly safe. If not, I wouldn't be surprised if this breaks C or C++ code somewhere, and it may be fairly hard to defend. So I'm strongly in favor of following established practice here, but I'm not positive what it is.

Are there other respects in which allowing a failed CAS to perform a write is programmer-observable?

We're still having an ongoing internal discussion about the interaction of AMOCAS with the syntactic dependency definitions in the RVWMO. At a minimum, the use of rd as an input register probably needs to be explicitly reflected there. I'm not yet sure whether there may be other issues.

aswaterman commented 1 year ago

@ved-rivos I had thought that x86 enforces write perms, but I also figured you’d know for sure.

gfavor commented 1 year ago

On Wed, Jul 26, 2023, 11:29 PM ved-rivos @.***> wrote:

I think of a fence as ordering memory accesses produced by instructions and not instructions.

I agree. I was simply being terse and less precise.

If a fence orders memory reads and writes - including memory read and any

memory writes generated by amocas - then it does not need to be aware of whether the amocas produced a memory write or not.

This argues for one answer to my question, but this is essentially the argument implicit in our choice for there to be release semantics only if the AMOCAS succeeds and hence officially does a memory write. Which then feeds into the leaning to be consistent here (for FENCEs) with that choice (for .aq/.rl).

From following posts it sounds like we are all converging on that "consistent" choice for FENCE interactions.

Greg

gfavor commented 1 year ago

On Wed, Jul 26, 2023 at 1:23 PM ved-rivos @.***> wrote:

An amocas generates a memory read access and, if successful, generates a memory write access. An unsuccessful amocas may either not perform a memory write or may write back the old value loaded from memory. An amocas instruction always requires write permissions.

The amocas.w/d/q optionally provide release consistency semantics, using the aq and rl bits, to help implement multiprocessor synchronization. The memory operation performed by amocas, when successful, has acquire semantics if aq bit is 1 and has release semantics if rl bit is 1. The memory operation performed by amocas, when not successful, has acquire semantics if aq bit is 1.

One concern is that the first paragraph talks about AMOCAS generating one or two memory accesses, and then the second paragraph talks about AMOSCAS performing one memory operation. And then in an earlier post we were talking about FENCE interactions and that that should be conceived in terms of the semantic memory accesses and not the memory operation.

This is probably going to confuse some people. We should talk about .aq/.rl semantics and fence-related semantics in a consistent way.

Maybe, after the release consistency paragraph, say something like:

Similarly, when the memory operation performed by amocas is successful, it is considered to atomically perform both memory read and memory write accesses for ordering purposes with respect to FENCE instructions. And when the memory operation is not successful, it is considered to only perform a memory read access.

Then, following that, one can state the non-normative comment Andrew suggested about an implementation being allowed to always do a memory write access - with the old memory value if the amoscas operations did not succeed. (I think this needs to be an explicit non-normative note given all the normative text about a failing amocas not doing a memory write.)

Greg

gfavor commented 1 year ago

On Wed, Jul 26, 2023 at 6:29 PM Hans Boehm @.***> wrote:

I mostly like the last formulation, though I'm nervous about "always requires write permissions".

I tried looking at the ARMv8 ARM, but it doesn't make it easy to figure out when the CAS instructions require write permission. The x86 CMPXCHG16B description is vague about the MMU permission check, but takes a GP exception "If the destination is located in a non-writable segment." Which makes me suspect that it also always checks for MMU write permission.

Implementation-wise it would seem pretty nasty to have to first go perform the amocas memory operation to find out if it succeeds or fails before then being able to know whether to take an exception on a write permission violation. Especially the further out in the cache/memory hierarchy that the operation is actually performed, and especially if (in ARM terms) the point of coherency is the L2 cache and not the L1's. So I would strongly argue for always requiring write permission (which I'm suspecting is what both x86 and ARM do).

Greg

Message ID: @.***>

ved-rivos commented 1 year ago

I had thought that x86 enforces write perms, but I also figured you’d know for sure.

Thats right . CMPXCHG* on x86 always produce a read and a write. The destination operand is written back if the comparison fails else the source operand is written back. The processor never produces a locked read without also producing a locked write.

CAS (and all AMO) in ARM check for write permissions - see here

ved-rivos commented 1 year ago

Maybe, after the release consistency paragraph, say something like:

Could I suggest a slightly modified statement: A FENCE instruction may be used to order the memory read access and the memory write access, if produced, by an AMOCAS.W/D/Q instruction.

Followed by this non-normative note: An unsuccessful AMOCAS.W/D/Q may either not perform a memory write or may write back the old value loaded from memory. The memory write, if produced, does not have release semantics, regardless of rl.

Updated the PR #21 with https://github.com/riscv/riscv-zacas/commit/b8ae7f4ea6bedd7a6e1b9b7706537df1ff02e1b1

ved-rivos commented 1 year ago

At a minimum, the use of rd as an input register probably needs to be explicitly reflected there. I'm not yet sure whether there may be other issues.

The register listings in section 17.3 should be updated: RV32: AMOCAS.D - sources: rs1(A), rs2(D), rs2+1(D) (unless rs2=x0), rd(D), rd+1(D) (unless rd=x0); destination: rd, rd+1 (unless rd=x0) AMOCAS.W - sources: rs1(A), rs2(D), rd(D); destination: rd

RV64: AMOCAS.W - sources: rs1(A), rs2(D), rd(D); destination: rd AMOCAS.D - sources: rs1(A), rs2(D), rd(D); destination: rd AMOCAS.Q - sources: rs1(A), rs2(D), rs2+1(D) (unless rs2=x0), rd(D), rd+1(D) (unless rd=x0); destination: rd, rd+1 (unless rd=x0) These instructions do not carry a dependency from any source register to any destination register.

If we agree to this I will update the text to document this as a table.

hboehm commented 1 year ago

I mentioned this issue on the WG21/SG1 C++ concurrency mailing list. Opinions there seem to be similar to here, with hardware architects in favor of requiring write permission, and the software folks not so sure.

This discussion there reminded me of a related issue that probably doesn't change the outcome here, but we should probably be aware of. IIUC, this proposal introduces a 128-bit CAS, while we don't have 128-bit atomic load, as is (unfortunately?) common on other architectures. This means that a C or C++ compiler implementing 128-bit atomics has two choices: It can turn all operations, including loads, into CAS operations, or it can ignore the 128-bit CAS. The latter appears to be more popular, also because it dodges some ABI compatibility issues. For the former, the implementation has to decide whether such atomics are "lock_free". Both answers are somewhat defensible, but claiming it is "lock_free" now results in a lock-free atomic with a load() operation that requires write permission, which does seem bad.

ved-rivos commented 1 year ago

So the instruction should be considered a compare-and-swap and will require write-permission. So not sure it is right to consider it as a atomic-load-pair instruction - which it is not intended to be.

Not having a 2*XLEN wide load may not be an issue for the canonical use of performing a compare-and-swap. Some algorithms may load the previous data value of a memory location into the register used as the compare data value source by a Zacas instruction using two individual loads. The first two individual loads may read an inconsistent pair of values but that is not an issue since the AMOCAS operation itself uses an atomic load-pair from memory to obtain the data value for its comparison. That atomically loaded value is provided in the register destination and in case of a retry can be used without needing to use individual loads for the retry.

gfavor commented 1 year ago

Although use of AMOCAS doesn't need an atomic load-pair, from Hans' last post it sounds likely that RISC-V C and C++ compilers won't support 128-bit atomics.

We should at least recognize if that will be the case and accept the consequence (or motivate an atomic load-pair extension).

Hans, would atomic load-pair be the only gap standing in the way of RISC-V 128-bit atomics support? Also, how much of a deal would the lack of 128-bit atomics support be?

Greg

aswaterman commented 1 year ago

Ideally, we’d get away without adding atomic load pair. But if we did decide to do so, it would behoove us to at least contemplate adding atomic store pair, too. Although the latter can be synthesized with a CAS-pair loop, it’s not terribly efficient.

ved-rivos commented 1 year ago

If the atomic-load-pair is to feed the amocas source then it can be accomplished as: li a0, 0; li a1, 0; amocas.q a0, x0, (a3). This will compare 0 to the memory location and if successful store 0 to that memory location else not; in all cases it returns the atomically loaded old value. However this does not require not testing write permission if this operation is to feed a subsequent amocas that intends to do a swap. I thought Han's comment was about a atomic-load-pair by itself.

ved-rivos commented 1 year ago

Here is the C++ code I see generated for 128-bit CAS on x86 and ARM: https://gcc.godbolt.org/z/r5M743W6z

Is there a reason C++ may not similarly support 128-bit CAS with amocas.q on RISC-V?

gfavor commented 1 year ago

Maybe a question to explicitly ask: Hans' reference to compiler support for 128-bit atomics encompasses what range of operations? (I've been assuming more than just cas and loads. Hence why I asked about any other ISA gaps that would hinder the above compiler support - which sounds problematic without atomic load-pair's.)

Greg

gfavor commented 1 year ago

And to be clear (about my understanding of Hans' post), cas doesn't need load-pair, but compiler support for 128-bit atomics does.

Greg

On Fri, Jul 28, 2023, 4:35 PM Greg Favor @.***> wrote:

Maybe a question to explicitly ask: Hans' reference to compiler support for 128-bit atomics encompasses what range of operations? (I've been assuming more than just cas and loads. Hence why I asked about any other ISA gaps that would hinder the above compiler support - which sounds problematic without atomic load-pair's.)

Greg

hboehm commented 1 year ago

Yes. Clean compiler support for 128 bit atomics requires a 128-bit load.

But I'm actually not certain it's sufficient at this point. We still have the problem that code using such a translation scheme is not compatible with code for older versions of the architecture without 128-bit CAS that have to use a lock. This means that it is no longer safe to link code compiled for earlier architectures against code that was compiled for the current version. That's very different from the 64-bit case, where earlier LR-SC code should interoperate with newer CAS code. I have to check again, but I think current compilers disagree about whether they use the 128-bit CAS at all, as a result of this issue.

In the absence of this ABI issue, I would think that 128-bit load is far more important than anything else. Stores would be nice, but compiling a store as a CAS doesn't introduce unexpected cache conflicts, as is the case for loads. If lots of threads write to the same 128-bit atomic, coherence misses are hopefully expected. If lots of threads read from the same location, they're not. Write permission is similar. From what I've seen on smaller systems at least, emulating other RMW operations with CAS often doesn't involve a huge overhead.

ved-rivos commented 1 year ago

clang seems to produce 128-bit CAS for x86 and ARM. icc does for x86 as well. Both require -mcx16. gcc used to produce till gcc 7.1 - but not since - many have complained about breaking their code but gcc seems to have taken a more conservative stand; gcc does produce it with _syncbool. Frameworks such as DPDK, boost, libatomic_ops, and linux kernel have provided their own implementations for 128-bit CAS. So gcc situation seems somewhat unfortunate.

Any further comments on the PR about memory ordering?

ved-rivos commented 1 year ago

Thanks all. I will merge the PR now.