Closed avorima closed 1 month ago
Some thoughts:
// If there is a concurrency issue, it will very likely become visible here.
.GetCurrentRequestByDatacenter
and DeleteCurrentRequestByDatacenter
. Also by SetCurrentRequestByDatacenter
, but that one was not part of the data race you linked. All 3 have locking in place. If there's nothing wrong with the Lock
and Unlock
calls, then the locking package itself might be the problem.sync.Mutex
might not have that issue. But the locker package is still used in other places and you didn't point out a fundamental issue in your PR.sync.Mutex
doesn't work in this scenario. While the unit tests are fine with it (that's a fixable weakness), the problem is that the map can be accessed by concurrent IonosCloudMachine reconciliations. Each of them has their own scope.Cluster
, with this PR each has its own mutex. But of course the very same mutex/whatever should be used by all scope.Cluster
instances. That's why the used locker lived in the machine reconciler itself and was just passed into the scope.Cluster
instances.you're right, it's an issue with the locker. i get no such lock
panics in both the scope and locker tests. one time the TestLockerContextDeadlineExceeded
tests also failed when asserting that the locks map is empty (it wasn't), but that test actually looks wrong in general.
doesn't seem to happen reliably below -count=100
on my machine.
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
@piepmatz I adjusted the locker to fix the Lock behavior on failed context and the timing of releasing the lock.
I ran these tests for some time on my machine and never got any test failures (go test -race -count=1000
). As a control I moved channel read in the unlock back to its original position and immediately got a no such lock
panic.
What is the purpose of this pull request/Why do we need it?
Prevents two locking-related race conditions:
Unlock()
allowedLock()
to proceed too early, leading to a panic on the nextUnlock()
callLock()
via context could lead to a panic on the nextUnlock()
callIssue #, if available:
Description of changes:
Special notes for your reviewer:
I noticed a failed run in a different PR: https://github.com/ionos-cloud/cluster-api-provider-ionoscloud/actions/runs/10078133906/job/27862335344?pr=190
Output
``` WARNING: DATA RACE Write at 0x00c000372060 by goroutine 118: runtime.mapdelete_faststr() /opt/hostedtoolcache/go/1.22.5/x64/src/runtime/map_faststr.go:301 +0x0 github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.(*Cluster).DeleteCurrentRequestByDatacenter() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster.go:199 +0x291 github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors.func1() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:384 +0x352 github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors.gowrap1() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:389 +0x61 Previous read at 0x00c000372060 by goroutine 122: runtime.mapaccess2_faststr() /opt/hostedtoolcache/go/1.22.5/x64/src/runtime/map_faststr.go:108 +0x0 github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.(*Cluster).GetCurrentRequestByDatacenter() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster.go:174 +0x2ef github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors.func1() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:372 +0xd7 github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors.gowrap1() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:389 +0x61 Goroutine 118 (running) created at: github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:369 +0x36a testing.tRunner() /opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1742 +0x44 Goroutine 122 (running) created at: github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:369 +0x36a testing.tRunner() /opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1742 +0x44 ================== panic: no such lock: uid/currentRequestByDatacenter goroutine 146 [running]: github.com/ionos-cloud/cluster-api-provider-ionoscloud/internal/util/locker.(*Locker).Unlock(0xc00046f7a0, {0xc0003ae5a0, 0x1e}) /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/internal/util/locker/locker.go:130 +0x1cc github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.(*Cluster).GetCurrentRequestByDatacenter(0xc00049b000, {0x244ccc2, 0x2}) /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster.go:175 +0x36f github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors.func1(0xc0003584e0, {0x244ccc2, 0x2}) /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:372 +0xd8 created by github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors in goroutine 76 /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:369 +0x36b ```I also did some fly-by changes that were highlighted by my IDE.
Checklist: