kubernetes-sigs / prow

Prow is a Kubernetes based CI/CD system developed to serve the Kubernetes community. This repository contains Prow source code and Hugo sources for Prow documentation site.
https://docs.prow.k8s.io
Apache License 2.0
113 stars 90 forks source link

`prow-controller-manager` (plank) concurrent map write error #288

Open smg247 opened 3 hours ago

smg247 commented 3 hours ago

Recently, we have noticed quite a few restarts in our prow-controller-manager instance due to a concurrent map write error. Excerpt from logs:

{"component":"prow-controller-manager","file":"sigs.k8s.io/prow/pkg/kube/config.go:47","func":"sigs.k8s.io/prow/pkg/kube.kubeConfigs","level":"info","msg":"Parsed kubeconfig context: build11","severity":"info","time":"2024-10-03T19:02:00Z"}
{"component":"prow-controller-manager","file":"sigs.k8s.io/prow/pkg/kube/config.go:127","func":"sigs.k8s.io/prow/pkg/kube.LoadClusterConfigs","level":"info","msg":"Loading kubeconfig from: \"/etc/build-farm-credentials/sa.prow-controller-manager.vsphere02.config\"","severity":"info","time":"2024-10-03T19:02:00Z"}
{"component":"prow-controller-manager","file":"sigs.k8s.io/prow/pkg/kube/config.go:47","func":"sigs.k8s.io/prow/pkg/kube.kubeConfigs","level":"info","msg":"Parsed kubeconfig context: vsphere02","severity":"info","time":"2024-10-03T19:02:00Z"}
fatal error: concurrent map writes

goroutine 1379 [running]:
sigs.k8s.io/controller-runtime/pkg/cache.defaultOpts(_, {0xc08dc48e70, 0xc000205c70, {0x2a81690, 0xc013c1aaf0}, 0x0, 0x0, 0xc08de9dce0, {0x0, 0x0}, ...})
    sigs.k8s.io/controller-runtime@v0.17.6/pkg/cache/cache.go:488 +0xa65
sigs.k8s.io/controller-runtime/pkg/cache.New(0xc03d7dab48, {0xc08dc48e70, 0xc000205c70, {0x2a81690, 0xc013c1aaf0}, 0x0, 0x0, 0xc08de9dce0, {0x0, 0x0}, ...})
    sigs.k8s.io/controller-runtime@v0.17.6/pkg/cache/cache.go:307 +0x78
sigs.k8s.io/controller-runtime/pkg/cluster.New(0xc03d7da908, {0xc08dc56d58, 0x1, 0x0?})
    sigs.k8s.io/controller-runtime@v0.17.6/pkg/cluster/cluster.go:201 +0x322
sigs.k8s.io/prow/pkg/flagutil.(*KubernetesOptions).BuildClusters.func1({_, _}, {{0xc07158e7c0, 0x34}, {0x0, 0x0}, {{0x0, 0x0}, {0x0, 0x0}, ...}, ...})
    sigs.k8s.io/prow/pkg/flagutil/kubernetes_cluster_clients.go:389 +0x1a9
created by sigs.k8s.io/prow/pkg/flagutil.(*KubernetesOptions).BuildClusters in goroutine 1
    sigs.k8s.io/prow/pkg/flagutil/kubernetes_cluster_clients.go:383 +0x2ef

goroutine 1 [semacquire]:
sync.runtime_Semacquire(0xc08de9df50?)
    runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc07bd07558?)
    sync/waitgroup.go:116 +0x48
sigs.k8s.io/prow/pkg/flagutil.(*KubernetesOptions).BuildClusters(0xc0008b60e8, 0x0?, {0xc073ad7080, 0x6, 0x6}, 0x26b3410, {0xc00086bb9e, 0x2}, {0x0, 0x0, ...})
    sigs.k8s.io/prow/pkg/flagutil/kubernetes_cluster_clients.go:421 +0x52a
main.main()
    sigs.k8s.io/prow/cmd/prow-controller-manager/main.go:170 +0x926

goroutine 5 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc000072e00)
    go.opencensus.io@v0.24.0/stats/view/worker.go:292 +0x9f
created by go.opencensus.io/stats/view.init.0 in goroutine 1
    go.opencensus.io@v0.24.0/stats/view/worker.go:34 +0x8d

goroutine 7 [chan receive]:
knative.dev/pkg/metrics.(*metricsWorker).start(...)
    knative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/metrics/metrics_worker.go:99
created by knative.dev/pkg/metrics.init.0 in goroutine 1
    knative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/metrics/exporter.go:39 +0xad

goroutine 90 [chan receive]:
sigs.k8s.io/prow/pkg/interrupts.handleInterrupt()
    sigs.k8s.io/prow/pkg/interrupts/interrupts.go:62 +0x85
created by sigs.k8s.io/prow/pkg/interrupts.init.0 in goroutine 1
    sigs.k8s.io/prow/pkg/interrupts/interrupts.go:41 +0xb2

This is a clear race condition, and doesn't happen every time we restart the pod.

smg247 commented 3 hours ago

/kind bug