sync/atomic: add OR/AND operators for unsigned types

aktau commented 1 year ago

Update, Aug 16 2023: Current proposal at https://github.com/golang/go/issues/61395#issuecomment-1671759854.

Original related proposal: https://github.com/golang/go/issues/31748

Use case: we have types with methods that set a value. These methods manipulate a bitset indicating that the value was set, which is used (e.g.) for data serialization. Users of this API know to use a lock to manage concurrently reading/writing the same field, but they are allowed to concurrently write to different fields. Given that the bitsets are logically shared between different fields, we must manipulate them atomically. Currently that takes the form of a CAS loop:

type MyStruct struct{
  x int32
  y int32
  // ...

  present uint32 // Contains "present" bits for 32 fields.
}

func (m *MyStruct) SetX(v int32) {
  m.x = v
  setPresent(&m.present[0], 1)
}

func (m *MyStruct) SetY(v int32) {
  m.y = v
  setPresent(&m.present[0], 2)
}

func setPresent(part *uint32, num uint32) {
    for {
        old := atomic.LoadUint32(part)
        swapped := atomic.CompareAndSwapUint32(part, old, old|(1<<(num%32)))
        if swapped {
            return
        }
        // Yield and then try the swap again.
        runtime.Gosched()
    }
}

// similar for clearPresent, but with AND.

But, on x86-64, there are atomic versions of AND/OR that do this in one go, as mentioned in https://github.com/golang/go/issues/31748. Using this would not only make the setters faster, it would likely also allow inlining them: setPresent is too complex to inline.

cc @aclements @ianlancetaylor @randall77

rsc commented 1 year ago

But, on x86-64, there are atomic versions of AND/OR that do this in one go, as mentioned in https://github.com/golang/go/issues/31748. Using this would not only make the setters faster, it would likely also allow inlining them: setPresent is too complex to inline.

setPresent should not be calling runtime.Gosched. There's nothing another goroutine is going to do to help move things along (unlike, say, waiting for a mutex to be unlocked).

If you remove the Gosched, it should be inlinable. At least, -m says this is inlinable:

func orcas(x *uint32, bit uint32) {
    for {
        old := atomic.Load(x)
        if atomic.Cas(x, old, old|bit) {
            break
        }
    }
}

I wrote this in runtime/atomic_test.go:

package runtime_test

import (
    "runtime/internal/atomic"
    "testing"
)

func BenchmarkCas(b *testing.B) {
    var x uint32
    for i := 0; i < b.N; i++ {
        for {
            old := atomic.Load(&x)
            if atomic.Cas(&x, old, old|1) {
                break
            }
        }
    }
}

func BenchmarkOr(b *testing.B) {
    var x uint32
    for i := 0; i < b.N; i++ {
        atomic.Or(&x, 1)
    }
}

The result is:

BenchmarkCas-16                     154634635            7.429 ns/op
BenchmarkOr-16                      270709120            4.462 ns/op

So this would be an x86-only optimization that wins less than 2X. I'm not sure this really makes sense given how unusual it typically is. The main argument I can see is inlinability, but it seems to be inlinable already if written efficiently.

aktau commented 1 year ago

I was wondering about runtime.Gosched() as well, I couldn't figure out why it would help. Thanks for highlighting that it is indeed unnecessary.

So this would be an x86-only optimization that wins less than 2X. I'm not sure this really makes sense given how unusual it typically is. The main argument I can see is inlinability, but it seems to be inlinable already if written efficiently.

A 2x win may still be significant, but I'll know more when I add a better benchmarking setup. But also I found LDSET/LDSETA/LDSETAL/LDSETL for ARM, quoting:

Atomic bit set on word or doubleword in memory atomically loads a 32-bit word or 64-bit doubleword from memory, performs a bitwise OR with the value held in a register on it, and stores the result back to memory. The value initially loaded from memory is returned in the destination register.

I think the "atomically" may be referring to the entire operation. To be honest the mnemonic doesn't make it sound like this is a bitwise or. Perhaps an ARM expert could chime in.

rsc commented 1 year ago

Thanks for the link to LDSETAL. That looks like it might be good enough, although we would have to think carefully about whether the acquire-release semantics it guarantees matches the sequentially consistent semantics we want for sync/atomic when limited to the OR operation. Perhaps it does but perhaps not.

aclements commented 1 year ago

But also I found LDSET/LDSETA/LDSETAL/LDSETL for ARM

That's correct. ARMV8.1 added instructions for atomic And (also, Min/Max, and, for some reason, Xor). LDCLR, LDCLRA, LDCLRAL, LDCLRL is the equivalent for atomic And.

aclements commented 1 year ago

A 2x win may still be significant

I'll note that there was also no contention in @rsc's benchmark. Typically, CAS loops collapse far worse under contention than direct atomic operations.

rsc commented 1 year ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

zephyrtronium commented 1 year ago

This might be a stretch, but another option could be to rewrite the simplest form of the CAS loop to the corresponding atomic operation on platforms that support it, similar to how "intrinsics" for encoding/binary work. That way old code using the thing that works now benefits as well. New API would still be convenient, but I think atomic flag sets that want it are rare, and if it's intrinsified then it can live in any package. I am given to understand this would be harder than existing rewrite rules, though, because it spans more than a single basic block. It may also be hard to formulate proofs that the rewrite preserves semantics.

aclements commented 1 year ago

This might be a stretch, but another option could be to rewrite the simplest form of the CAS loop to the corresponding atomic operation on platforms that support it

@randall77 actually prototyped exactly this. I'm not necessarily opposed to having these rewrite rules, but I don't think they're a substitute for just giving people the API that they need. Often, when writing low-level code like this, the developer wants to say what they mean and have a good sense of what's actually going on, so having to express this operation through a rather opaque rewrite rule seems too subtle.

rsc commented 1 year ago

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

rsc commented 1 year ago

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

mateusz834 commented 1 year ago

What new API additions were proposed here?

rsc commented 1 year ago

I guess we all assumed we knew what the API was but forgot to write it down. Oops! I think the new API is take anything that says Add in the current sync/atomic package and s/Add/And/ and also s/Add/Or/. So there would be both new top-level functions and new methods on Int32, Int64, Uint32, Uint64, and Uintptr.

mauri870 commented 1 year ago

Anyone currently working on this one? I've made some progress already and would like to finish the work. If that is ok please feel free to assign the issue to me. Thanks.

mauri870 commented 1 year ago

I see that the current internal/atomic Or/And don't return a new value as opposed to the Add apis.

If we want to keep the same api style as Add we need to change Or/And to return a new value instead of applying the bitwise operation inplace.

Just want to confirm that this is indeed an expected outcome, I see that it is used in a couple places in runtime and atomic tests.

randall77 commented 1 year ago

We should make the new API for sync/atomic consistent with what is already in sync/atomic. So And/Or should return the new value.

If that makes it inconsistent with runtime/internal/atomic, so be it. We can always clean up runtime/internal/atomic later. (Or maybe you are saying we should make them consistent as part of this change so we can share SSA ops like ssa.AtomicOr8? That would be fine.)

One thing that bugs me a bit: there is a choice to return the new value or the old value. With Add the choice doesn't really matter, as you can always derive the other choice by subtracting/adding the arg you passed to Add. But that doesn't work with And/Or as they are not reversible. You can get the new value from the old value, but not vice versa. I guess we should think a bit about what that return value might be used for. (I guess if you really needed to know the old value you could use a CAS loop instead.)

aclements commented 1 year ago

It seems really unfortunate to have these operations destroy useful information by returning the new value. In isolation, clearly we would want them to return the old value, but that would be confusingly inconsistent with the rest of sync/atomic.

These operations could return both the old and new value, like:

func AndUint32(addr *uint32, mask uint32) (old, new uint32)
func (x *Uint32) And(mask uint32) (old, new uint32)

This would allow us to return the old value, while making the return signature unmistakably different from other functions in sync/atomic to avoid confusion. I'm sure most uses would assign one result or the other (or possibly both) to _, and the compiler could easily optimize away the unnecessary parts of the operation. In general it can be easy to mistake the order of multiple results of the same type, but in this case the result order is consistent with many other norms like left-to-right timelines and LTR writing.

This reminds me a lot of GCC's atomic intrinsics, where several have both an __atomic_OP_fetch and an __atomic_fetch_OP variant that differ in whether they return the old value or the new value. In Go we don't have the limitation of returning just one or the other.

The one downside I see to returning both old and new is that these functions couldn't be used as part of a larger expression. Users could of course wrap it in their own operation that returned just one result or the other, and use that in expression contexts.

mauri870 commented 1 year ago

If that makes it inconsistent with runtime/internal/atomic, so be it. We can always clean up runtime/internal/atomic later. (Or maybe you are saying we should make them consistent as part of this change so we can share SSA ops like ssa.AtomicOr8? That would be fine.)

I haven't consider touching the SSA. I'm not really familiar with it to pull this off. My idea is to implement Or32/Or64 and And32/And64 in internal/atomic to serve as basis for the sync/atomic work. In order to not mess up the current Or/And and variants I will leave them untouched for now.

These operations could return both the old and new value, like:

I like this approach, it is unfortunate that it prevents its use in larger expressions, at least this approach is explicit about the returns so people don't accidentally use old for new and vice versa. Another option is just to stick with the api we currently have for add, returning the new value only

aclements commented 1 year ago

I like this approach, it is unfortunate that it prevents its use in larger expressions

Looking at the handful of uses in the runtime, in fact, none of them use the result at all. So I'm not too worried about the inability to use these operations in a larger expression.

at least this approach is explicit about the returns so people don't accidentally use old for new and vice versa.

+1

mauri870 commented 1 year ago

@randall77 Thoughts on returning old and new as suggested by @aclements? I would like to agree on a return value(s) in order to start working on the other architectures. Also /cc @rsc

randall77 commented 1 year ago

Sounds fine to me.

randall77 commented 1 year ago

As far as implementation goes:

On x86, for Add we use LOCK+XADD (eXchange and add) to both get the old value and do the add. There is no corresponding X* instruction for And and Or, so if the user wants any return value (either old or new) we'll need a CAS loop to get it. If the user doesn't need either return value we can use LOCK+AND/OR on a memory location.

On arm64, our strategy for Add depends on a runtime check of whether we have advanced atomics (required starting in arm spec 8.4). Without them, we use a LDAXR/STLXR loop. With them, we use the atomic instruction LDADDAL which returns the old value from memory. There are equivalent instructions for AND and OR. So arm64 seems more straightforward.

ianlancetaylor commented 1 year ago

To spell it out, since the whole point of these is efficiency on real hardware, and since we care about x86, that suggests that these functions should not return any value at all.

People who need the value have other options.

bcmills commented 1 year ago

FWIW, the CAS loop in cmd/go/internal/modload does use the “old” value from an OR operation.

(But the fact that that loop already exists supports Ian's statement that “People who need the value have other options.” 🙃)

rsc commented 1 year ago

I am not entirely convinced that "no return values" is the right answer. A blind OR and AND does not let provide any way to understand the specific action that happened. If you are doing something like "race to set a bit and the first to set it does extra actions", you need the result. In fact most uses I can imagine of atomic OR and AND would need that result.

Even C++, which typically overfits to hardware far more than Go does, returns an atomically obtained value: https://en.cppreference.com/w/cpp/atomic/atomic/fetch_and https://en.cppreference.com/w/cpp/atomic/atomic_fetch_and (I don't understand the difference between the two, but they definitely both return the value involved.)

I think we'd want to be very sure of ourselves before prioritizing efficiency over API usability more than C++ does.

And and Or would have to deviate from Add by returning the old value, which carries more information than the new value.

rsc commented 1 year ago

Also, what does C++ do with these on x86? I tried Godbolt here but it calls a library routine for both calls. One option would be to define that they return the old value and call a library routine by default (which would have to CAS), but if we see code not using the return value, we can use the x86 AND/OR atomic instruction directly. Essentially all calls to these routines will be direct calls, so the optimization will be easy to do.

For concreteness, I think the APIs we should be considering are:

func AndInt32(addr *int32, bits int32) (old int32)
func AndInt64(addr *int64, bits int64) (old int64)
func AndUint32(addr *uint32, bits uint32) (old uint32)
func AndUint64(addr *uint64, bits uint64) (old uint64)
func AndUintptr(addr *uintptr, bits uintptr) (old uintptr)

func (*Int32) And(bits int32) (old int32)
func (*Int64) And(bits int64) (old int64)
func (*Uint32) And(bits uint32) (old uint32)
func (*Uint64) And(bits uint64) (old uint64)
func (*Uintptr) And(bits uintptr) (old uintptr)

func OrInt32(addr *int32, bits int32) (old int32)
func OrInt64(addr *int64, bits int64) (old int64)
func OrUint32(addr *uint32, bits uint32) (old uint32)
func OrUint64(addr *uint64, bits uint64) (old uint64)
func OrUintptr(addr *uintptr, bits uintptr) (old uintptr)

func (*Int32) Or(bits int32) (old int32)
func (*Int64) Or(bits int64) (old int64)
func (*Uint32) Or(bits uint32) (old uint32)
func (*Uint64) Or(bits uint64) (old uint64)
func (*Uintptr) Or(bits uintptr) (old uintptr)

rsc commented 1 year ago

Moving back to active discussion.

aclements commented 1 year ago

In fact most uses I can imagine of atomic OR and AND would need that result.

I agree that we should return the old value because there are uses for that, but I will say that none of the uses in the runtime use the result.

For concreteness, I think the APIs we should be considering are:

@rsc, what did you think of my suggestion that we return both old and new, to avoid confusion with other operations in sync/atomic that return just new?

bcmills commented 1 year ago

what did you think of https://github.com/golang/go/issues/61395#issuecomment-1666647948 that we return both old and new, to avoid confusion with other operations in sync/atomic that return just new?

FWIW, the existing Swap methods do return only old. 🤔

rsc commented 1 year ago

I had missed that suggestion, but I like @bcmills's observation that Swap only returns old. If you sit and think for a second, Swap can only possibly return old, of course. The same is true for And and Or. I think we made a mistake with Add, but oh well. I think one return is still probably best.

rsc commented 1 year ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

iBug commented 1 year ago

Returning a single value is definitely the right way to go for obvious reasons. And I believe returning the old value provides the best versatility: Whoever needs the new value can take the operand and do a non-atomic operation locally.

Worth noting, that Add is different from And/Or in a significant way: With a given operand, Add is invertible, while And/Or are not. This makes the choice between returning the new or the old value less of a problem for Add.

aktau commented 1 year ago

As long as users that don't need the return value get the efficient (single instruction on x86) encoding, this seems fine.

mauri870 commented 1 year ago

From what I understood this needs to be implemented with a CAS loop if we are going to return the old value.

Sorry about my bad assembly skills, but I came up with this:

// func And64(addr *uint64, v uint64) uint64
TEXT ·And64(SB), NOSPLIT, $0-24
    MOVQ    ptr+0(FP), BX
    MOVQ    val+8(FP), CX
casloop:
    MOVQ    CX, DX
    MOVQ    (BX), AX
    ANDQ    AX, DX
    LOCK
    CMPXCHGQ    DX, (BX)
    JNZ casloop
    MOVQ    AX, ret+16(FP)
    RET

Maybe if performance is really a concern we should expose the current And Or apis as is, ie not return any value.

randall77 commented 1 year ago

From what I understood this needs to be implemented with a CAS loop if we are going to return the old value.

Correct.

Maybe if performance is really a concern we should expose the current And Or apis as is, ie not return any value.

We can certainly detect in the compiler that the result is not being used, and convert to LOCK AND on memory directly.

rsc commented 1 year ago

Sounds like people are happy with https://github.com/golang/go/issues/61395#issuecomment-1671759854.

rsc commented 1 year ago

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

rsc commented 1 year ago

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

mauri870 commented 1 year ago

Upstream patch on llvm tsan to support the race variants for the and/or operators: ~https://reviews.llvm.org/D159370~ https://github.com/llvm/llvm-project/pull/65695

I was able to compile the linux syso myself with this patch and it seems to work fine when used alongside -race.

I'll try to start pushing some patches related to this ticket in the upcoming days.

gopherbot commented 1 year ago

Change https://go.dev/cl/526656 mentions this issue: internal/atomic: add wasm And/Or operators

gopherbot commented 1 year ago

Change https://go.dev/cl/528315 mentions this issue: runtime/internal/atomic: add 386/amd64 And/Or operators

gopherbot commented 1 year ago

Change https://go.dev/cl/528797 mentions this issue: runtime/internal/atomic: add arm/arm64 operators for And/Or

gopherbot commented 1 year ago

Change https://go.dev/cl/531575 mentions this issue: runtime/internal/atomic: add riscv64 operators for And/Or

gopherbot commented 1 year ago

Change https://go.dev/cl/531716 mentions this issue: runtime/internal/atomic: add And/Or operators for ppc64x

gopherbot commented 1 year ago

Change https://go.dev/cl/531835 mentions this issue: runtime/internal/atomic: add And/Or operators for mips

cherrymui commented 1 year ago

Sorry for being late to the issue. It seems CL https://go.dev/cl/528797 etc. adds And, Or operations that returns the old value. This is inconsistent to what Add does (which returns the new value), and doesn't seem to be explicitly called out on the proposal. Why does it return the old value?

cherrymui commented 1 year ago

Okay, it is indeed discussed above. Sorry for the noise.

gopherbot commented 1 year ago

Change https://go.dev/cl/531895 mentions this issue: runtime/internal/atomic: add loong64 operators for And/Or

gopherbot commented 1 year ago

Change https://go.dev/cl/531678 mentions this issue: runtime/internal/atomic: add s390x operators for And/Or

mauri870 commented 1 year ago

Just to give an update on the progress of this task, I do have most of the code ready but since I'm sending individual CLs it is taking a slower pace to get them merged.

I got upstream LLVM's tsan updated with the new and/or ops and will proceed to generate the updated syso files for the race detector.

After completing these tasks and merging all the architecture-specific internal/atomic CLs that I have open, I will be able to proceed with the public sync/atomic APIs

gopherbot commented 12 months ago

Change https://go.dev/cl/543035 mentions this issue: runtime/race: update race syso files to support atomic And, Or

golang / go

sync/atomic: add OR/AND operators for unsigned types #61395