:bug: Fix race condition bugs in locker package

avorima commented 2 months ago

What is the purpose of this pull request/Why do we need it?

Prevents two locking-related race conditions:

Unlock() allowed Lock() to proceed too early, leading to a panic on the next Unlock() call
Canceling a Lock() via context could lead to a panic on the next Unlock() call

Issue #, if available:

Description of changes:

Special notes for your reviewer:

I noticed a failed run in a different PR: https://github.com/ionos-cloud/cluster-api-provider-ionoscloud/actions/runs/10078133906/job/27862335344?pr=190

Output

``` WARNING: DATA RACE Write at 0x00c000372060 by goroutine 118: runtime.mapdelete_faststr() /opt/hostedtoolcache/go/1.22.5/x64/src/runtime/map_faststr.go:301 +0x0 github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.(*Cluster).DeleteCurrentRequestByDatacenter() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster.go:199 +0x291 github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors.func1() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:384 +0x352 github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors.gowrap1() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:389 +0x61 Previous read at 0x00c000372060 by goroutine 122: runtime.mapaccess2_faststr() /opt/hostedtoolcache/go/1.22.5/x64/src/runtime/map_faststr.go:108 +0x0 github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.(*Cluster).GetCurrentRequestByDatacenter() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster.go:174 +0x2ef github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors.func1() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:372 +0xd7 github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors.gowrap1() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:389 +0x61 Goroutine 118 (running) created at: github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:369 +0x36a testing.tRunner() /opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1742 +0x44 Goroutine 122 (running) created at: github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors() /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:369 +0x36a testing.tRunner() /opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1689 +0x21e testing.(*T).Run.gowrap1() /opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1742 +0x44 ================== panic: no such lock: uid/currentRequestByDatacenter goroutine 146 [running]: github.com/ionos-cloud/cluster-api-provider-ionoscloud/internal/util/locker.(*Locker).Unlock(0xc00046f7a0, {0xc0003ae5a0, 0x1e}) /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/internal/util/locker/locker.go:130 +0x1cc github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.(*Cluster).GetCurrentRequestByDatacenter(0xc00049b000, {0x244ccc2, 0x2}) /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster.go:175 +0x36f github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors.func1(0xc0003584e0, {0x244ccc2, 0x2}) /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:372 +0xd8 created by github.com/ionos-cloud/cluster-api-provider-ionoscloud/scope.TestCurrentRequestByDatacenterAccessors in goroutine 76 /home/runner/work/cluster-api-provider-ionoscloud/cluster-api-provider-ionoscloud/scope/cluster_test.go:369 +0x36b ```

I also did some fly-by changes that were highlighted by my IDE.

Checklist:

[ ] Documentation updated
[ ] Unit Tests added
[ ] E2E Tests added
[ ] Includes emojis

piepmatz commented 2 months ago

Some thoughts:

I'm glad this was discovered.
Even more that it was a unit test saying // If there is a concurrency issue, it will very likely become visible here..
The map should have been properly locked already. It's accessed by the cluster scope methods GetCurrentRequestByDatacenter and DeleteCurrentRequestByDatacenter. Also by SetCurrentRequestByDatacenter, but that one was not part of the data race you linked. All 3 have locking in place. If there's nothing wrong with the Lock and Unlock calls, then the locking package itself might be the problem.
If so, using a sync.Mutex might not have that issue. But the locker package is still used in other places and you didn't point out a fundamental issue in your PR.
sync.Mutex doesn't work in this scenario. While the unit tests are fine with it (that's a fixable weakness), the problem is that the map can be accessed by concurrent IonosCloudMachine reconciliations. Each of them has their own scope.Cluster, with this PR each has its own mutex. But of course the very same mutex/whatever should be used by all scope.Cluster instances. That's why the used locker lived in the machine reconciler itself and was just passed into the scope.Cluster instances.

avorima commented 2 months ago

you're right, it's an issue with the locker. i get no such lock panics in both the scope and locker tests. one time the TestLockerContextDeadlineExceeded tests also failed when asserting that the locks map is empty (it wasn't), but that test actually looks wrong in general. doesn't seem to happen reliably below -count=100 on my machine.

sonarcloud[bot] commented 2 months ago

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

avorima commented 2 months ago

@piepmatz I adjusted the locker to fix the Lock behavior on failed context and the timing of releasing the lock.

I ran these tests for some time on my machine and never got any test failures (go test -race -count=1000). As a control I moved channel read in the unlock back to its original position and immediately got a no such lock panic.

ionos-cloud / cluster-api-provider-ionoscloud

:bug: Fix race condition bugs in locker package #191

Quality Gate passed