Open bboreham opened 3 years ago
I'm new, but let me take a look. Where can I see test results or did you see on your machine? How frequently? Can I run "make test" multiple times in parallel to fasten repro? @bboreham
Can I run "make test" multiple times in parallel to fasten repro?
Run the specific go test with -count=100 -race
.
Investigation so far into the race condition:
func (m *Memberlist) Members() []*Node
function returns a list of pointers to member Nodes, prometheus alertmanager/cluster.go uses data from the returned list. There is a comment above this function that one must not write through the pointers, since the data is supposed to be protected by Memberlist mutex. The issue is that there is at least one situation when even read becomes inconsistent. If a node rejoins a memberlist with the same Name but different Address or Port, after being dead or after having left , then it's data may be reclaimed and it's new data is written into the same Pointer.
I don't know how serious this issue is. To be triggered, the memberlist has to be configured with DeadNodeReclaimTime > 0 and a node with different address needs to leave/be marked dead and rejoin.
Strictly speaking the Members function is theoretically broken and it should return a copy of the data , not pointers to data under a mutex.
Proof: The bellow test yields this output (even in latest version):
Node address = 127.0.0.11 *Node = 0xc00007a180
Node address = 127.0.0.11 *Node = 0xc00007a180
127.0.0.112021/11/11 14:29:31 [ERR] memberlist: Failed to send ping: write udp 127.0.0.11:37259->127.0.0.10:37259: use of closed network connection
127.0.0.112021/11/11 14:29:31 [ERR] memberlist: Failed to send gossip to 127.0.0.10:37259: write udp 127.0.0.11:37259->127.0.0.10:37259: use of closed network connection
127.0.0.102021/11/11 14:29:31 [DEBUG] memberlist: Stream connection from=127.0.0.1:49450
127.0.0.122021/11/11 14:29:31 [DEBUG] memberlist: Initiating push/pull sync with: 127.0.0.10 127.0.0.10:37259
127.0.0.102021/11/11 14:29:31 [DEBUG] memberlist: Stream connection from=127.0.0.1:49452
127.0.0.102021/11/11 14:29:31 [INFO] memberlist: Updating address for left or failed node 127.0.0.11 from 127.0.0.11:37259 to 127.0.0.12:37259
Node address = 127.0.0.12 *Node = 0xc00007a180
Node address = 127.0.0.12 *Node = 0xc00007a180
func TestMemberlist_JoinDataRace(t *testing.T) {
c1 := testConfig(t)
c1.DeadNodeReclaimTime = 1 * time.Nanosecond
m1, err := Create(c1)
require.NoError(t, err)
defer m1.Shutdown()
bindPort := m1.config.BindPort
// Create a second node
c2 := testConfig(t)
c2.BindPort = bindPort
m2, err := Create(c2)
require.NoError(t, err)
//defer m2.Shutdown()
fmt.Println("KRAJO1: ", c1.Name)
fmt.Println("KRAJO2: ", c2.Name)
shutdownChannel := make(chan int)
lister := func() int {
for {
select {
case x := <- shutdownChannel:
return x * 0
default:
members := m1.Members()
for _,v := range members {
if v.Name != "127.0.0.10" {
fmt.Printf("Node address = %v *Node = %p\n", v.Addr, v)
}
}
}
}
}
shutdownLister := func() {
shutdownChannel <- 0
}
defer shutdownLister()
go lister()
num, err := m2.Join([]string{m1.config.Name + "/" + m1.config.BindAddr})
if num != 1 {
t.Fatalf("unexpected 1: %d", num)
}
if err != nil {
t.Fatalf("unexpected err: %s", err)
}
// Check the hosts
if len(m2.Members()) != 2 {
t.Fatalf("should have 2 nodes! %v", m2.Members())
}
if m2.estNumNodes() != 2 {
t.Fatalf("should have 2 nodes! %v", m2.Members())
}
yield()
time.After(1 * time.Second)
m2.Leave(1 * time.Second)
m2.Shutdown()
// Re-create the second node with a different address
c3 := testConfig(t)
c3.Name = c2.Name // force reclaim
c3.BindPort = bindPort
m3, err := Create(c3)
require.NoError(t, err)
defer m3.Shutdown()
numB, errB := m3.Join([]string{m1.config.Name + "/" + m1.config.BindAddr})
if numB != 1 {
t.Fatalf("unexpected 1: %d", numB)
}
if errB != nil {
t.Fatalf("unexpected err: %s", errB)
}
time.After(5 * time.Second)
}
As far as I can tell, the worst that can happen is that the information that prometheus cluster.go is using will be newer. Even if the data is garbled, it's not used directly. https://github.com/prometheus/alertmanager/blob/fad796931b792fc30f35bb18a580ae7323ef0241/cluster/cluster.go#L448
I did reproduce the race, but with a slightly more direct method:
WARNING: DATA RACE
Write at 0x00c0000a07a8 by goroutine 48:
github.com/hashicorp/memberlist.(*Memberlist).aliveNode()
/home/krajo/go/github.com/hashicorp/memberlist/state.go:1114 +0x2666
github.com/hashicorp/memberlist.(*Memberlist).mergeState()
/home/krajo/go/github.com/hashicorp/memberlist/state.go:1305 +0x284
github.com/hashicorp/memberlist.(*Memberlist).mergeRemoteState()
/home/krajo/go/github.com/hashicorp/memberlist/net.go:1174 +0x194
github.com/hashicorp/memberlist.(*Memberlist).handleConn()
/home/krajo/go/github.com/hashicorp/memberlist/net.go:277 +0x94a
github.com/hashicorp/memberlist.(*Memberlist).streamListen·dwrap·19()
/home/krajo/go/github.com/hashicorp/memberlist/net.go:213 +0x58
Previous read at 0x00c0000a07a8 by goroutine 30:
github.com/hashicorp/memberlist.(*Node).Address()
/home/krajo/go/github.com/hashicorp/memberlist/state.go:43 +0x6a
github.com/hashicorp/memberlist.TestMemberlist_JoinDataRace.func1()
/home/krajo/go/github.com/hashicorp/memberlist/memberlist_test.go:2122 +0xb5
github.com/hashicorp/memberlist.TestMemberlist_JoinDataRace·dwrap·123()
/home/krajo/go/github.com/hashicorp/memberlist/memberlist_test.go:2133 +0x39
Goroutine 48 (running) created at:
github.com/hashicorp/memberlist.(*Memberlist).streamListen()
/home/krajo/go/github.com/hashicorp/memberlist/net.go:213 +0x44
github.com/hashicorp/memberlist.newMemberlist·dwrap·4()
/home/krajo/go/github.com/hashicorp/memberlist/memberlist.go:216 +0x39
Goroutine 30 (running) created at:
github.com/hashicorp/memberlist.TestMemberlist_JoinDataRace()
/home/krajo/go/github.com/hashicorp/memberlist/memberlist_test.go:2133 +0x68b
testing.tRunner()
/usr/local/go/src/testing/testing.go:1259 +0x22f
testing.(*T).Run·dwrap·21()
/usr/local/go/src/testing/testing.go:1306 +0x47
==================
--- FAIL: TestMemberlist_JoinDataRace (1.26s)
testing.go:1152: race detected during execution of test
FAIL
Test code:
func TestMemberlist_JoinDataRace(t *testing.T) {
c1 := testConfig(t)
c1.DeadNodeReclaimTime = 1 * time.Nanosecond
m1, err := Create(c1)
require.NoError(t, err)
defer m1.Shutdown()
bindPort := m1.config.BindPort
// Create a second node
c2 := testConfig(t)
c2.BindPort = bindPort
m2, err := Create(c2)
require.NoError(t, err)
//defer m2.Shutdown()
num, err := m2.Join([]string{m1.config.Name + "/" + m1.config.BindAddr})
if num != 1 {
t.Fatalf("unexpected 1: %d", num)
}
if err != nil {
t.Fatalf("unexpected err: %s", err)
}
// Check the hosts
if len(m2.Members()) != 2 {
t.Fatalf("should have 2 nodes! %v", m2.Members())
}
if m2.estNumNodes() != 2 {
t.Fatalf("should have 2 nodes! %v", m2.Members())
}
yield()
time.After(1 * time.Second)
shutdownChannel := make(chan int)
members := m1.Members()
lister := func() int {
for {
select {
case x := <- shutdownChannel:
return x * 0
default:
// members := m1.Members()
for _,v := range members {
yield()
if v.Address() == "" {
panic("cannot be")
}
}
}
}
}
shutdownLister := func() {
shutdownChannel <- 0
}
defer shutdownLister()
go lister()
m2.Leave(1 * time.Second)
m2.Shutdown()
// Re-create the second node with a different address
c3 := testConfig(t)
c3.Name = c2.Name // force reclaim
c3.BindPort = bindPort
m3, err := Create(c3)
require.NoError(t, err)
defer m3.Shutdown()
numB, errB := m3.Join([]string{m1.config.Name + "/" + m1.config.BindAddr})
if numB != 1 {
t.Fatalf("unexpected 1: %d", numB)
}
if errB != nil {
t.Fatalf("unexpected err: %s", errB)
}
time.After(5 * time.Second)
}
The above race-condition can be circumvented in our case with something like:
func (m *Memberlist) MemberAddresses() []string {
m.nodeLock.RLock()
defer m.nodeLock.RUnlock()
addresses := make([]string, 0, len(m.nodes))
for _, n := range m.nodes {
if !n.DeadOrLeft() {
addresses = append(addresses, n.Node.Address())
}
}
return addresses
}
Opened https://github.com/hashicorp/memberlist/issues/250 to get some feedback on this issue.
Test reports a data race; also it runs for four minutes and consumes > 500MB of RAM.