runtime: maps behave inconsistently on arm64 and amd64 architectures when the -race detection is not enabled

Anthony-Dong commented 1 month ago

Go version

go version go1.20 linux/amd64 go version go1.20 darwin/arm64

Output of `go env` in your module/workspace:

1. Mac-arm64(Apple M3 Pro)

GO111MODULE="on"
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/bytedance/Library/Caches/go-build"
GOENV="/Users/bytedance/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/bytedance/go/pkg/mod"
GOOS="darwin"
GOPATH="/Users/bytedance/go"
GOROOT="/Users/bytedance/software/go1.20"
GOSUMDB="sum.golang.google.cn"
GOTMPDIR=""
GOTOOLDIR="/Users/bytedance/software/go1.20/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.20"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="0"
GOMOD="/Users/bytedance/go/src/github.com/anthony-dong/golang/example/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/13/yc48l58j4j9_rrnff17jqdk00000gn/T/go-build1128814608=/tmp/go-build -gno-record-gcc-switches -fno-common"

linux-amd64(Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz)

GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/xxx/.cache/go-build"
GOENV="/home/xxx/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/xxx/go/pkg/mod"
GOOS="linux"
GOPATH="/home/xxx/go"
GOROOT="/home/xxx/software/go/go1.20"
GOTMPDIR=""
GOTOOLDIR="/home/xxx/software/go/go1.20/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/xxx/go/src/github.com/anthony-dong/golang/example/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build461736185=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I have a piece of code that behaves inconsistently on Mac-Apple M3 Pro and Linux amd64 when the -race detection is not enabled. On Mac-Apple M3 Pro, it directly reports a map rw error, but on Linux amd64, it doesn't.

package test_map

import (
    "sync"
    "testing"
)

type Map struct {
    kv map[string]interface{} // 8byte
}

func (m *Map) Get(key string) interface{} {
    return m.kv[key]
}

func (m *Map) clone() map[string]interface{} {
    res := make(map[string]interface{}, len(m.kv))
    for k, v := range m.kv {
        res[k] = v
    }
    return res
}

func (m *Map) Set(mm map[string]interface{}) {
    clone := m.clone()
    for k, v := range mm {
        clone[k] = v
    }
    m.kv = clone // replace
}

func TestMapRW(t *testing.T) {
    wg := sync.WaitGroup{}
    wg.Add(11)
    kv := Map{}
    kv.Set(map[string]interface{}{"1": 1})
    for i := 0; i < 10; i++ {
        go func() {
            defer wg.Done()
            for j := 0; j < 100000; j++ {
                kv.Get("1")
            }
        }()
    }

    for i := 0; i < 1; i++ {
        go func() {
            defer wg.Done()
            for j := 0; j < 100000; j++ {
                kv.Set(map[string]interface{}{"2": 2})
            }
        }()
    }

    wg.Wait()
}

What did you see happen?

Linux

~ go test -run=TestMapRW -count=20 ./...
ok      github.com/anthony-dong/golang/example/benchmark/test_map   0.802s

Mac

~ go test -run=TestMapRW -count=20 ./...
fatal error: concurrent map read and map write

goroutine 30 [running]:
github.com/anthony-dong/golang/example/benchmark/test_map.(*Map).Get(...)
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:13
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW.func1()
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:41 +0x7c
created by github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38 +0x274

goroutine 1 [chan receive]:
testing.(*T).Run(0x140000829c0, {0x104994913?, 0x2b9544170e8d8?}, 0x1049f4060)
        /Users/bytedance/software/go1.20/src/testing/testing.go:1630 +0x384
testing.runTests.func1(0x140000829c0?)
        /Users/bytedance/software/go1.20/src/testing/testing.go:2036 +0x48
testing.tRunner(0x140000829c0, 0x1400014dc68)
        /Users/bytedance/software/go1.20/src/testing/testing.go:1576 +0x104
testing.runTests(0x1400012e0a0?, {0x104aa61c0, 0x1, 0x1}, {0x14000118800?, 0x40?, 0x104aae660?})
        /Users/bytedance/software/go1.20/src/testing/testing.go:2034 +0x3c4
testing.(*M).Run(0x1400012e0a0)
        /Users/bytedance/software/go1.20/src/testing/testing.go:1906 +0x530
main.main()
        _testmain.go:47 +0x1a8

goroutine 21 [semacquire]:
sync.runtime_Semacquire(0x140001169c0?)
        /Users/bytedance/software/go1.20/src/runtime/sema.go:62 +0x2c
sync.(*WaitGroup).Wait(0x1400011a120)
        /Users/bytedance/software/go1.20/src/sync/waitgroup.go:116 +0x74
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW(0x0?)
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:55 +0x358
testing.tRunner(0x14000082b60, 0x1049f4060)
        /Users/bytedance/software/go1.20/src/testing/testing.go:1576 +0x104
created by testing.(*T).Run
        /Users/bytedance/software/go1.20/src/testing/testing.go:1629 +0x370

goroutine 22 [runnable]:
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW.func1()
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38
created by github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38 +0x274

goroutine 23 [runnable]:
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW.func1()
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38
created by github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38 +0x274

goroutine 24 [runnable]:
github.com/anthony-dong/golang/example/benchmark/test_map.(*Map).Get(...)
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:13
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW.func1()
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:41 +0x7c
created by github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38 +0x274

goroutine 25 [runnable]:
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW.func1()
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38
created by github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38 +0x274

goroutine 26 [runnable]:
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW.func1()
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38
created by github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38 +0x274

goroutine 27 [runnable]:
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW.func1()
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38
created by github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38 +0x274

goroutine 28 [runnable]:
github.com/anthony-dong/golang/example/benchmark/test_map.(*Map).Get(...)
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:13
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW.func1()
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:41 +0x7c
created by github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38 +0x274

goroutine 29 [runnable]:
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW.func1()
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38
created by github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38 +0x274

goroutine 31 [runnable]:
github.com/anthony-dong/golang/example/benchmark/test_map.(*Map).Get(...)
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:13
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW.func1()
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:41 +0x7c
created by github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:38 +0x274

goroutine 32 [runnable]:
github.com/anthony-dong/golang/example/benchmark/test_map.(*Map).clone(...)
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:18
github.com/anthony-dong/golang/example/benchmark/test_map.(*Map).Set(...)
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:25
github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW.func2()
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:50 +0x170
created by github.com/anthony-dong/golang/example/benchmark/test_map.TestMapRW
        /Users/bytedance/go/src/github.com/anthony-dong/golang/example/benchmark/test_map/map_test.go:47 +0x2ec
FAIL    github.com/anthony-dong/golang/example/benchmark/test_map       0.723s
FAIL

What did you expect to see?

Why do different chips behave inconsistently?
If it is a memory race condition, it should report a memory access violation rather than a map rw error.

gabyhelp commented 1 month ago

Related Issues and Documentation

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

randall77 commented 1 month ago

Why do different chips behave inconsistently?

That is strange. I can reproduce.

If it is a memory race condition, it should report a memory access violation rather than a map rw error.

I'm not sure what you mean here by a "memory access violation". Do you mean, similar to what the race detector reports?

Without -race, the detection of racy map accesses is best effort. Such racy accesses may be detected, they may not. This may be part of the explanation of the differences between architectures. Regardless, if the map implementation does detect a race it is intentional that it reports a map rw error, not any report that would come from the race detector when -race is provided. Mostly this is because we don't keep track of the stack trace of the first of the two racy accesses, so a report like the race detector provides is impossible.

Anthony-Dong commented 1 month ago

It's not easy to simulate memory access violations in Go, but it can be done using C++. Below are two pieces of code written in C++, similar to Go's unsafe replace. There will be a segmentation fault, but adding a read/write lock can solve the problem! @randall77

segmentation fault code


#include <map>
#include <thread>
#include <vector>

int main() { std::vector threads{11}; std::map<std::string, int> kvs{{"1", 1}}; for (int x = 0; x < 10; x++) { threads[x] = std::thread([&]() { for (int x = 0; x < 100000; x++) { auto _ = kvs["1"]; } }); }

threads[10] = std::thread([&]() {
    for (int x = 0; x < 100000; x++) {
        std::map<std::string, int> clone_kvs{};
        // copy
        clone_kvs.insert(kvs.begin(), kvs.end());
        // set
        clone_kvs["2"] = 2;
        // replace
        kvs = std::move(clone_kvs);
    }
});

for (auto& thread : threads) {
    thread.join();
}

}


2. right code

```cpp
#include <map>
#include <mutex>
#include <shared_mutex>
#include <thread>
#include <vector>

int main() {
    std::shared_mutex rw_lock{};

    std::vector<std::thread> threads{11};
    std::map<std::string, int> kvs{{"1", 1}};

    for (int x = 0; x < 10; x++) {
        threads[x] = std::thread([&]() {
            for (int x = 0; x < 100000; x++) {
                {
                    std::shared_lock<std::shared_mutex> rlock{rw_lock};
                    auto _ = kvs["1"];
                }
            }
        });
    }

    threads[10] = std::thread([&]() {
        for (int x = 0; x < 100000; x++) {
            std::map<std::string, int> clone_kvs{kvs};
            // safe read
            {
                std::shared_lock<std::shared_mutex> rlock{rw_lock};
                clone_kvs.insert(kvs.begin(), kvs.end());
            }

            // set clone kvs
            clone_kvs["2"] = 2;

            // safe replace
            {
                std::unique_lock<std::shared_mutex> wlock{rw_lock};
                kvs = std::move(clone_kvs);
            }
        }
    });

    for (auto& thread : threads) {
        thread.join();
    }
}

randall77 commented 1 month ago

I'm not sure what you are proposing then.

We could of course cause a segfault instead of reporting a "rw error". But I don't think that's very helpful. Why is the former better when the latter provides a more accurate description of the user's error?

Anthony-Dong commented 1 month ago

I'm not sure what you are proposing then.

We could of course cause a segfault instead of reporting a "rw error". But I don't think that's very helpful. Why is the former better when the latter provides a more accurate description of the user's error?

Because from the perspective of the code, it is very difficult to understand that this problem is caused by concurrent access to the map resulting in an RW exception. This explanation doesn't make sense.

Hope to use more explicit exception information

prattmic commented 1 month ago

As written, this code doesn't directly race on the internals of the map. It does explicitly race on m.kv (the pointer to the map).

That is, in program order, the replacement map is fully written before being assigned to m.kv. After assignment the map is not modified.

On a system with a total store order (TSO, all stores become visible in program order, i.e., amd64), the concurrent map read and map write error will never occur because the reader will never see the hashWriting bit.

On a system without TSO (arm64), I suspect something like this occurs:

G1 in Set(), finishing up writing the last element to the new map, clone
G1 clears the writing flag: clone.flags &^= hashWriting (clone.flags == 0 now)
G1 publishes the new map m.kv = clone.
G2 reads m.kv, receiving the new map clone
G2 starts accessing the map and reads clone.flag == hashWriting.
G2 triggers concurrent read/write map error.

What happens here is that without TSO, one CPU clears hashWriting and then writes to m.kv, but the other CPU observes the write to m.kv but not the write to clone.flags. So this explains why only arm64 sees the throw.

As for why not segfault, I agree with @randall77 in not really understanding why we should do so. The fact that your C++ code segfaults is a coincidence of implementation. C++ certainly does not guarantee that racy accesses will segfault. Our code could theoretically segfault on arm64 if the reader observed clone.flags as 0, but saw some other internal state as stale in a way that confused the reading side.

In practice, on x86 I'm almost certain that 8-byte stores never tear, so readers will always see a valid pointer, though which is not guaranteed. I believe arm64 is the same, though maybe not for unaligned stores?

Given these semantics, I don't even think it would be possible to guarantee a fault on racy access without something like the full race detector in every program.

Anthony-Dong commented 1 month ago

Thanks, I was able to solve this issue using atomic.Value or sync.RWMutex. However, with the proliferation of Apple's M-series chips, such cases are indeed hard to understand because the code that previously had no issues is now problematic after switching to the M-series chips. @prattmic

Anthony-Dong commented 1 month ago

On a system with a total store order (TSO, all stores become visible in program order, i.e., amd64), the concurrent map read and map write error will never occur because the reader will never see the hashWriting bit.

On a system without TSO (arm64), I suspect something like this occurs:

G1 in Set(), finishing up writing the last element to the new map, clone

G1 clears the writing flag: clone.flags &^= hashWriting (clone.flags == 0 now)

G1 publishes the new map m.kv = clone.

G2 reads m.kv, receiving the new map clone

G2 starts accessing the map and reads clone.flag == hashWriting.

G2 triggers concurrent read/write map error.

Indeed, this is the issue. can use the following program to reproduce it.

package main

import "time"

type Data struct {
    flag int
}

func main() {
    var data = &Data{flag: 1}

    go func() {
        for {
            if data.flag != 1 {
                panic(`error`)
            }
        }
    }()

    go func() {
        for {
            clone := &Data{flag: 2}
            clone.flag = 1
            data = clone
        }
    }()

    time.Sleep(time.Second * 5)
}

prattmic commented 1 month ago

However, with the proliferation of Apple's M-series chips, such cases are indeed hard to understand because the code that previously had no issues is now problematic after switching to the M-series chips.

Indeed, this is the classic pain point when porting applications from amd64 to arm64 (or anything else, really: amd64 has the most strict memory ordering semantics. See table at https://en.wikipedia.org/wiki/Memory_ordering#In_symmetric_multiprocessing_(SMP)_microprocessor_systems).

I'm going to close this, as I don't think there is anything we can do here. Note that the race detector should catch these cases (if it does not, please reopen). If you have suggestions on how we can improve the situation, we're open to hearing them.

golang / go