soundcloud / roshi

Roshi is a large-scale CRDT set implementation for timestamped events.
BSD 2-Clause "Simplified" License
3.15k stars 154 forks source link

Failed in testing repairs #45

Open yujunz opened 4 years ago

yujunz commented 4 years ago

I tried to port roshi to go 1.14 but failed in some tests related to repair

https://travis-ci.com/github/yujunz/roshi/builds/176022842#L377

=== RUN   TestAllRepairs
    repair_strategies_test.go:34: pre-repair: cluster 2: only got 0 responses
    repair_strategies_test.go:34: pre-repair: cluster 3: only got 0 responses
--- FAIL: TestAllRepairs (0.00s)
=== RUN   TestRateLimitedRepairs
    repair_strategies_test.go:87: post-repair: cluster 0: has [{Key:foo Score:2.3 Member:delta} {Key:foo Score:2.2 Member:beta} {Key:foo Score:2.1 Member:alpha}]
    repair_strategies_test.go:87: post-repair: cluster 1: has [{Key:foo Score:2.2 Member:beta} {Key:foo Score:2.1 Member:alpha} {Key:foo Score:1.3 Member:delta}]
    repair_strategies_test.go:87: post-repair: cluster 2: has [{Key:foo Score:2.2 Member:beta} {Key:foo Score:2.1 Member:alpha}]
    repair_strategies_test.go:92: post-repair: cluster 2: expected [{Key:foo Score:2.2 Member:beta} {Key:foo Score:2.1 Member:alpha} {Key:foo Score:1.3 Member:delta}], got [{Key:foo Score:2.2 Member:beta} {Key:foo Score:2.1 Member:alpha}]
    repair_strategies_test.go:87: post-repair: cluster 3: has [{Key:foo Score:2.2 Member:beta} {Key:foo Score:2.1 Member:alpha}]
    repair_strategies_test.go:92: post-repair: cluster 3: expected [{Key:foo Score:2.2 Member:beta} {Key:foo Score:2.1 Member:alpha} {Key:foo Score:1.3 Member:delta}], got [{Key:foo Score:2.2 Member:beta} {Key:foo Score:2.1 Member:alpha}]
    repair_strategies_test.go:87: post-repair: cluster 4: has [{Key:foo Score:2.2 Member:beta} {Key:foo Score:2.1 Member:alpha} {Key:foo Score:1.3 Member:delta}]
--- FAIL: TestRateLimitedRepairs (0.00s)
=== RUN   TestExplodingGoroutines
--- PASS: TestExplodingGoroutines (0.03s)
=== RUN   TestMakeSet
--- PASS: TestMakeSet (0.00s)
=== RUN   TestAddHas
--- PASS: TestAddHas (0.00s)
=== RUN   TestAddMany
--- PASS: TestAddMany (0.00s)
=== RUN   TestOrderedLimitedSlice
--- PASS: TestOrderedLimitedSlice (0.00s)
FAIL
FAIL    github.com/yujunz/roshi/farm    0.052s

However, running test locally did PASS. What could be the problem?

yujunz commented 4 years ago

I repeated the test for several times and can reproduce the errors as above. There is also one crash observed

=== RUN   TestAllRepairs
fatal error: concurrent map writes

goroutine 190 [running]:
runtime.throw(0x12eae35, 0x15)
    /usr/local/Cellar/go/1.14.4/libexec/src/runtime/panic.go:1116 +0x72 fp=0xc0001a5df8 sp=0xc0001a5dc8 pc=0x1035092
runtime.mapassign_faststr(0x12984c0, 0xc000109680, 0x12e6813, 0x3, 0xc0001eca48)
    /usr/local/Cellar/go/1.14.4/libexec/src/runtime/map_faststr.go:211 +0x3f7 fp=0xc0001a5e60 sp=0xc0001a5df8 pc=0x1014677
github.com/yujunz/roshi/farm.(*mockCluster).Insert(0xc000109650, 0xc000109830, 0x1, 0x1, 0xc00009c2a0, 0x1506b00)
    /Users/yujunz/Code/github.com/soundcloud/roshi/farm/mock_cluster_test.go:109 +0x290 fp=0xc0001a5f10 sp=0xc0001a5e60 pc=0x1250580
github.com/yujunz/roshi/farm.(*Farm).Insert.func1(0x1347d40, 0xc000109650, 0xc000109830, 0x1, 0x1, 0x152b8d0, 0x0)
    /Users/yujunz/Code/github.com/soundcloud/roshi/farm/farm.go:65 +0x4f fp=0xc0001a5f50 sp=0xc0001a5f10 pc=0x1259abf
github.com/yujunz/roshi/farm.(*Farm).write.func2(0xc000026600, 0x12f79f8, 0xc000109830, 0x1, 0x1, 0x1347d40, 0xc000109650)
    /Users/yujunz/Code/github.com/soundcloud/roshi/farm/farm.go:125 +0x62 fp=0xc0001a5fa8 sp=0xc0001a5f50 pc=0x1259c92
runtime.goexit()
    /usr/local/Cellar/go/1.14.4/libexec/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc0001a5fb0 sp=0xc0001a5fa8 pc=0x1066d71
created by github.com/yujunz/roshi/farm.(*Farm).write
    /Users/yujunz/Code/github.com/soundcloud/roshi/farm/farm.go:124 +0x214

goroutine 1 [chan receive]:
testing.(*T).Run(0xc00012c900, 0x12e8bd6, 0xe, 0x12f7a78, 0x1083301)
    /usr/local/Cellar/go/1.14.4/libexec/src/testing/testing.go:1043 +0x37e
testing.runTests.func1(0xc00012c480)
    /usr/local/Cellar/go/1.14.4/libexec/src/testing/testing.go:1284 +0x78
testing.tRunner(0xc00012c480, 0xc00013fe10)
    /usr/local/Cellar/go/1.14.4/libexec/src/testing/testing.go:991 +0xdc
testing.runTests(0xc00000e380, 0x14fda80, 0x14, 0x14, 0x0)
    /usr/local/Cellar/go/1.14.4/libexec/src/testing/testing.go:1282 +0x2a7
testing.(*M).Run(0xc000148000, 0x0)
    /usr/local/Cellar/go/1.14.4/libexec/src/testing/testing.go:1199 +0x15f
main.main()
    _testmain.go:82 +0x135

goroutine 189 [runnable]:
github.com/yujunz/roshi/farm.TestAllRepairs(0xc00012c900)
    /Users/yujunz/Code/github.com/soundcloud/roshi/farm/repair_strategies_test.go:37 +0x537
testing.tRunner(0xc00012c900, 0x12f7a78)
    /usr/local/Cellar/go/1.14.4/libexec/src/testing/testing.go:991 +0xdc
created by testing.(*T).Run
    /usr/local/Cellar/go/1.14.4/libexec/src/testing/testing.go:1042 +0x357

Could it be a problem in the mocked cluster?

yujunz commented 4 years ago

It seems caused by race condition in map. Fixed in 96986754212a9d15784a52a04fd7a00887b30b6a by adding mutex.

matthiasr commented 3 years ago

There is something more going on here. I enabled running the tests in GitHub actions and these tests kept failing (#49). I disabled them, merged, and then realized that I hadn't turned on the race detector (which was on before). Weirdly, with the race detector on, they pass (#50)! Locally for me, they also pass without race detection. Maybe the tests are still timing sensitive, and fail if they run too fast?