opencost / opencost

Cost monitoring for Kubernetes workloads and cloud costs
http://opencost.io
Apache License 2.0
5.29k stars 554 forks source link

concurrent map read and map write #2910

Open umats opened 2 months ago

umats commented 2 months ago

Describe the bug OpenCost often crashes on startup on large clusters

To Reproduce Start an opencost pod

Expected behavior opencost runs

Which version of OpenCost are you using? 1.111.0

Additional context `fatal error: concurrent map read and map write

goroutine 461 [running]: k8s.io/apimachinery/pkg/labels.Set.Has(0x8b4e4f?, {0xc004d0cd1c?, 0x1?}) /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.30.2/pkg/labels/labels.go:53 +0x28 k8s.io/apimachinery/pkg/labels.(Requirement).Matches(0xc023d81e40, {0x3e51720, 0xc01e0fac90}) /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.30.2/pkg/labels/selector.go:226 +0xcf k8s.io/apimachinery/pkg/labels.internalSelector.Matches(...) /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.30.2/pkg/labels/selector.go:387 github.com/opencost/opencost/pkg/costmodel.getPodServices({0x3e8bfc0?, 0xc0021d30e0?}, {0xc026234000, 0x158b, 0x2?}, {0xc000058023, 0x8}) /home/runner/work/opencost/opencost/opencost/pkg/costmodel/costmodel.go:1428 +0x24c github.com/opencost/opencost/pkg/costmodel.(CostModel).costDataRange(0xc005717020, {0x3e52020, 0xc000aa4620}, {0x3e95b30, 0xc025dac0a0}, {0xc026004078, 0xc026004048}, 0x34630b8a000, {0x0, 0x0}, ...) /home/runner/work/opencost/opencost/opencost/pkg/costmodel/costmodel.go:1794 +0x296d github.com/opencost/opencost/pkg/costmodel.(CostModel).ComputeCostDataRange.func1() /home/runner/work/opencost/opencost/opencost/pkg/costmodel/costmodel.go:1689 +0x5e golang.org/x/sync/singleflight.(Group).doCall.func2(0xc02604d136, 0xc0260140f0, 0xc000600001?) /home/runner/go/pkg/mod/golang.org/x/sync@v0.7.0/singleflight/singleflight.go:198 +0x64 golang.org/x/sync/singleflight.(Group).doCall(0x2a65080?, 0xc026006360?, {0xc026010080?, 0x32?}, 0xc025caf2a0?) /home/runner/go/pkg/mod/golang.org/x/sync@v0.7.0/singleflight/singleflight.go:200 +0x96 golang.org/x/sync/singleflight.(Group).Do(0xc019910aa0, {0xc026010080, 0x32}, 0xc025caf230) /home/runner/go/pkg/mod/golang.org/x/sync@v0.7.0/singleflight/singleflight.go:113 +0x15a github.com/opencost/opencost/pkg/costmodel.(CostModel).ComputeCostDataRange(0xc005717020, {0x3e52020, 0xc000aa4620}, {0x3e95b30, 0xc025dac0a0}, {0xc026004078?, 0xc026004048?}, 0x34630b8a000, {0x0, 0x0}, ...) /home/runner/work/opencost/opencost/opencost/pkg/costmodel/costmodel.go:1688 +0x1f9 github.com/opencost/opencost/pkg/costmodel.(Accesses).ComputeAggregateCostModel(0xc0261a60f0, {0x3e52020, 0xc000aa4620}, {0xc026004078, 0xc026004048}, {0x2f66749, 0x9}, {0x56f0b40, 0x0, 0x0}, ...) /home/runner/work/opencost/opencost/opencost/pkg/costmodel/aggregation.go:1415 +0x1fb7 github.com/opencost/opencost/pkg/costmodel.(Accesses).warmAggregateCostModelCache.func1(0x0?, 0xdf8475800, 0x1) /home/runner/work/opencost/opencost/opencost/pkg/costmodel/aggregation.go:1785 +0x4fd github.com/opencost/opencost/pkg/costmodel.(Accesses).warmAggregateCostModelCache.func2(0xc001406170) /home/runner/work/opencost/opencost/opencost/pkg/costmodel/aggregation.go:1818 +0x8e created by github.com/opencost/opencost/pkg/costmodel.(*Accesses).warmAggregateCostModelCache in goroutine 1 /home/runner/work/opencost/opencost/opencost/pkg/costmodel/aggregation.go:1810 +0x137

goroutine 1 [IO wait]: internal/poll.runtime_pollWait(0x7f7ac3328a08, 0x72) /opt/hostedtoolcache/go/1.22.4/x64/src/runtime/netpoll.go:345 +0x85 internal/poll.(pollDesc).wait(0x7?, 0x10?, 0x0) /opt/hostedtoolcache/go/1.22.4/x64/src/internal/poll/fd_poll_runtime.go:84 +0x27 internal/poll.(pollDesc).waitRead(...) /opt/hostedtoolcache/go/1.22.4/x64/src/internal/poll/fd_poll_runtime.go:89 internal/poll.(FD).Accept(0xc009834a80) /opt/hostedtoolcache/go/1.22.4/x64/src/internal/poll/fd_unix.go:611 +0x2ac net.(netFD).accept(0xc009834a80) /opt/hostedtoolcache/go/1.22.4/x64/src/net/fd_unix.go:172 +0x29 net.(TCPListener).accept(0xc009847b80) /opt/hostedtoolcache/go/1.22.4/x64/src/net/tcpsock_posix.go:159 +0x1e net.(TCPListener).Accept(0xc009847b80) /opt/hostedtoolcache/go/1.22.4/x64/src/net/tcpsock.go:327 +0x30 net/http.(Server).Serve(0xc0261a62d0, {0x3e645b0, 0xc009847b80}) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/server.go:3255 +0x33e net/http.(Server).ListenAndServe(0xc0261a62d0) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/server.go:3184 +0x71 net/http.ListenAndServe(...) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/server.go:3438 github.com/opencost/opencost/pkg/cmd/costmodel.Execute(0x2710901?) /home/runner/work/opencost/opencost/opencost/pkg/cmd/costmodel/costmodel.go:104 +0xcdf github.com/opencost/opencost/pkg/cmd.Execute.newCostModelCommand.func1(0xc000117900?, {0x2f5d9c9?, 0x4?, 0x2f5d9cd?}) /home/runner/work/opencost/opencost/opencost/pkg/cmd/commands.go:108 +0x2f github.com/spf13/cobra.(Command).execute(0xc001382dc8, {0x56f0b40, 0x0, 0x0}) /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:856 +0x69d github.com/spf13/cobra.(Command).ExecuteC(0xc001383088) /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:974 +0x38d github.com/spf13/cobra.(*Command).Execute(...) /home/runner/go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:902 github.com/opencost/opencost/pkg/cmd.Execute(0x60?, {0x0, 0x0, 0x0}) /home/runner/work/opencost/opencost/opencost/pkg/cmd/commands.go:61 +0x3a5 main.main() /home/runner/work/opencost/opencost/opencost/cmd/costmodel/main.go:11 +0x1c

goroutine 30 [select]: go.opencensus.io/stats/view.(*worker).start(0xc000bcc000) /home/runner/go/pkg/mod/go.opencensus.io@v0.24.0/stats/view/worker.go:292 +0x9f created by go.opencensus.io/stats/view.init.0 in goroutine 1 /home/runner/go/pkg/mod/go.opencensus.io@v0.24.0/stats/view/worker.go:34 +0x8d

goroutine 32 [sync.Cond.Wait]: sync.runtime_notifyListWait(0xc00166e238, 0x3) /opt/hostedtoolcache/go/1.22.4/x64/src/runtime/sema.go:569 +0x159 sync.(Cond).Wait(0xc0175c5c00?) /opt/hostedtoolcache/go/1.22.4/x64/src/sync/cond.go:70 +0x85 k8s.io/client-go/tools/cache.(DeltaFIFO).Pop(0xc00166e210, 0xc003b1e420) /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/tools/cache/delta_fifo.go:575 +0x236 k8s.io/client-go/tools/cache.(controller).processLoop(0xc003b14b40) /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/tools/cache/controller.go:188 +0x30 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.30.2/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003bb8f58, {0x3e44ce0, 0xc003ba4090}, 0x1, 0xc00149f1a0) /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.30.2/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc003bb8f58, 0x3b9aca00, 0x0, 0x1, 0xc00149f1a0) /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.30.2/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.30.2/pkg/util/wait/backoff.go:161 k8s.io/client-go/tools/cache.(controller).Run(0xc003b14b40, 0xc00149f1a0) /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/tools/cache/controller.go:159 +0x35e created by github.com/opencost/opencost/pkg/clustercache.(*CachingWatchController).WarmUp in goroutine 85 /home/runner/work/opencost/opencost/opencost/pkg/clustercache/watchcontroller.go:169 +0x98

goroutine 12 [chan receive]: github.com/opencost/opencost/pkg/errors.SetPanicHandler.func1() /home/runner/work/opencost/opencost/opencost/pkg/errors/panic.go:59 +0x45 created by github.com/opencost/opencost/pkg/errors.SetPanicHandler in goroutine 1 /home/runner/work/opencost/opencost/opencost/pkg/errors/panic.go:57 +0x65

goroutine 13 [sync.Cond.Wait]: sync.runtime_notifyListWait(0xc0010ab990, 0x17) /opt/hostedtoolcache/go/1.22.4/x64/src/runtime/sema.go:569 +0x159 sync.(Cond).Wait(0x30?) /opt/hostedtoolcache/go/1.22.4/x64/src/sync/cond.go:70 +0x85 github.com/opencost/opencost/core/pkg/collections.(blockingSliceQueue[...]).Dequeue(0x3e87720) /home/runner/work/opencost/opencost/opencost/core/pkg/collections/blockingqueue.go:73 +0x93 github.com/opencost/opencost/pkg/prom.(*RateLimitedPrometheusClient).worker(0xc000aa4620) /home/runner/work/opencost/opencost/opencost/pkg/prom/prom.go:250 +0x9e created by github.com/opencost/opencost/pkg/prom.NewRateLimitedClient in goroutine 1 /home/runner/work/opencost/opencost/opencost/pkg/prom/prom.go:195 +0x4ba

goroutine 14 [select]: net/http.(persistConn).roundTrip(0xc025c97e60, 0xc0270ae540) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/transport.go:2675 +0x979 net/http.(Transport).roundTrip(0xc000cacb40, 0xc025e7afc0) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/transport.go:608 +0x79a net/http.(Transport).RoundTrip(0x2cb2240?, 0xc0270a92f0?) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/roundtrip.go:17 +0x13 github.com/opencost/opencost/core/pkg/util/httputil.userAgentTransport.RoundTrip({{0xc000c9f3b0?, 0xc02544f998?}, {0x3e434c0?, 0xc000cacb40?}}, 0xc025e7aea0) /home/runner/work/opencost/opencost/opencost/core/pkg/util/httputil/roundtrip.go:30 +0x291 net/http.send(0xc025e7aea0, {0x3e438c0, 0xc00139e480}, {0x1?, 0x0?, 0x0?}) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/client.go:259 +0x5e4 net/http.(Client).send(0xc0010ab948, 0xc025e7aea0, {0xc003a09f80?, 0x10?, 0x0?}) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/client.go:180 +0x98 net/http.(Client).do(0xc0010ab948, 0xc025e7aea0) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/client.go:724 +0x8dc net/http.(Client).Do(...) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/client.go:590 github.com/prometheus/client_golang/api.(httpClient).Do(0xc0010ab940, {0x3e733f0, 0x56f0b40}, 0xc025d1a480?) /home/runner/go/pkg/mod/github.com/prometheus/client_golang@v1.17.0/api/client.go:125 +0x148 github.com/opencost/opencost/pkg/prom.(RateLimitedPrometheusClient).worker(0xc000aa4620) /home/runner/work/opencost/opencost/opencost/pkg/prom/prom.go:274 +0x1d3 created by github.com/opencost/opencost/pkg/prom.NewRateLimitedClient in goroutine 1 /home/runner/work/opencost/opencost/opencost/pkg/prom/prom.go:195 +0x4ba

goroutine 15 [sync.Cond.Wait]: sync.runtime_notifyListWait(0xc0010ab990, 0x15) /opt/hostedtoolcache/go/1.22.4/x64/src/runtime/sema.go:569 +0x159 sync.(Cond).Wait(0x30?) /opt/hostedtoolcache/go/1.22.4/x64/src/sync/cond.go:70 +0x85 github.com/opencost/opencost/core/pkg/collections.(blockingSliceQueue[...]).Dequeue(0x3e87720) /home/runner/work/opencost/opencost/opencost/core/pkg/collections/blockingqueue.go:73 +0x93 github.com/opencost/opencost/pkg/prom.(*RateLimitedPrometheusClient).worker(0xc000aa4620) /home/runner/work/opencost/opencost/opencost/pkg/prom/prom.go:250 +0x9e created by github.com/opencost/opencost/pkg/prom.NewRateLimitedClient in goroutine 1 /home/runner/work/opencost/opencost/opencost/pkg/prom/prom.go:195 +0x4ba

goroutine 16 [sync.Cond.Wait]: sync.runtime_notifyListWait(0xc0010ab990, 0x16) /opt/hostedtoolcache/go/1.22.4/x64/src/runtime/sema.go:569 +0x159 sync.(Cond).Wait(0x30?) /opt/hostedtoolcache/go/1.22.4/x64/src/sync/cond.go:70 +0x85 github.com/opencost/opencost/core/pkg/collections.(blockingSliceQueue[...]).Dequeue(0x3e87720) /home/runner/work/opencost/opencost/opencost/core/pkg/collections/blockingqueue.go:73 +0x93 github.com/opencost/opencost/pkg/prom.(*RateLimitedPrometheusClient).worker(0xc000aa4620) /home/runner/work/opencost/opencost/opencost/pkg/prom/prom.go:250 +0x9e created by github.com/opencost/opencost/pkg/prom.NewRateLimitedClient in goroutine 1 /home/runner/work/opencost/opencost/opencost/pkg/prom/prom.go:195 +0x4ba

goroutine 49 [select]: net/http.(persistConn).roundTrip(0xc025c96d80, 0xc0260902c0) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/transport.go:2675 +0x979 net/http.(Transport).roundTrip(0xc000cacb40, 0xc0260a9b00) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/transport.go:608 +0x79a net/http.(Transport).RoundTrip(0x2cb2240?, 0xc026084810?) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/roundtrip.go:17 +0x13 github.com/opencost/opencost/core/pkg/util/httputil.userAgentTransport.RoundTrip({{0xc000c9f3b0?, 0xc025cb3998?}, {0x3e434c0?, 0xc000cacb40?}}, 0xc0260a99e0) /home/runner/work/opencost/opencost/opencost/core/pkg/util/httputil/roundtrip.go:30 +0x291 net/http.send(0xc0260a99e0, {0x3e438c0, 0xc00139e480}, {0x1?, 0x18?, 0x0?}) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/client.go:259 +0x5e4 net/http.(Client).send(0xc0010ab948, 0xc0260a99e0, {0xc001aa4000?, 0xc001aa4000?, 0x0?}) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/client.go:180 +0x98 net/http.(Client).do(0xc0010ab948, 0xc0260a99e0) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/client.go:724 +0x8dc net/http.(Client).Do(...) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/client.go:590 github.com/prometheus/client_golang/api.(httpClient).Do(0xc0010ab940, {0x3e733f0, 0x56f0b40}, 0xc02619f440?) /home/runner/go/pkg/mod/github.com/prometheus/client_golang@v1.17.0/api/client.go:125 +0x148 github.com/opencost/opencost/pkg/prom.(RateLimitedPrometheusClient).worker(0xc000aa4620) /home/runner/work/opencost/opencost/opencost/pkg/prom/prom.go:274 +0x1d3 created by github.com/opencost/opencost/pkg/prom.NewRateLimitedClient in goroutine 1 /home/runner/work/opencost/opencost/opencost/pkg/prom/prom.go:195 +0x4ba

goroutine 43 [chan receive]: k8s.io/client-go/tools/cache.(controller).Run.func1() /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/tools/cache/controller.go:132 +0x25 created by k8s.io/client-go/tools/cache.(controller).Run in goroutine 119 /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/tools/cache/controller.go:131 +0xa9

goroutine 40 [IO wait]: internal/poll.runtime_pollWait(0x7f7ac3328de8, 0x72) /opt/hostedtoolcache/go/1.22.4/x64/src/runtime/netpoll.go:345 +0x85 internal/poll.(pollDesc).wait(0xc0014a0080?, 0xc00d240000?, 0x0) /opt/hostedtoolcache/go/1.22.4/x64/src/internal/poll/fd_poll_runtime.go:84 +0x27 internal/poll.(pollDesc).waitRead(...) /opt/hostedtoolcache/go/1.22.4/x64/src/internal/poll/fd_poll_runtime.go:89 internal/poll.(FD).Read(0xc0014a0080, {0xc00d240000, 0xa000, 0xa000}) /opt/hostedtoolcache/go/1.22.4/x64/src/internal/poll/fd_unix.go:164 +0x27a net.(netFD).Read(0xc0014a0080, {0xc00d240000?, 0x7f7ac1f7e398?, 0xc0177238f0?}) /opt/hostedtoolcache/go/1.22.4/x64/src/net/fd_posix.go:55 +0x25 net.(conn).Read(0xc001406000, {0xc00d240000?, 0xc001480930?, 0x411d7b?}) /opt/hostedtoolcache/go/1.22.4/x64/src/net/net.go:179 +0x45 crypto/tls.(atLeastReader).Read(0xc0177238f0, {0xc00d240000?, 0x0?, 0xc0177238f0?}) /opt/hostedtoolcache/go/1.22.4/x64/src/crypto/tls/conn.go:806 +0x3b bytes.(Buffer).ReadFrom(0xc00140a2b0, {0x3e44220, 0xc0177238f0}) /opt/hostedtoolcache/go/1.22.4/x64/src/bytes/buffer.go:211 +0x98 crypto/tls.(Conn).readFromUntil(0xc00140a008, {0x3e44060, 0xc001406000}, 0xc001480978?) /opt/hostedtoolcache/go/1.22.4/x64/src/crypto/tls/conn.go:828 +0xde crypto/tls.(Conn).readRecordOrCCS(0xc00140a008, 0x0) /opt/hostedtoolcache/go/1.22.4/x64/src/crypto/tls/conn.go:626 +0x3cf crypto/tls.(Conn).readRecord(...) /opt/hostedtoolcache/go/1.22.4/x64/src/crypto/tls/conn.go:588 crypto/tls.(Conn).Read(0xc00140a008, {0xc000bb9000, 0x1000, 0x8?}) /opt/hostedtoolcache/go/1.22.4/x64/src/crypto/tls/conn.go:1370 +0x156 net/http.(persistConn).Read(0xc0010d19e0, {0xc000bb9000?, 0xc00197b020?, 0xc001480d38?}) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/transport.go:1977 +0x4a bufio.(Reader).fill(0xc001a31860) /opt/hostedtoolcache/go/1.22.4/x64/src/bufio/bufio.go:110 +0x103 bufio.(Reader).Peek(0xc001a31860, 0x1) /opt/hostedtoolcache/go/1.22.4/x64/src/bufio/bufio.go:148 +0x53 net/http.(persistConn).readLoop(0xc0010d19e0) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/transport.go:2141 +0x1b9 created by net/http.(Transport).dialConn in goroutine 50 /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/transport.go:1799 +0x152f

goroutine 67 [select]: k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop(0xc00164c9c0) /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/util/workqueue/delaying_queue.go:276 +0x2ff created by k8s.io/client-go/util/workqueue.newDelayingQueue in goroutine 1 /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/util/workqueue/delaying_queue.go:113 +0x205

goroutine 68 [select]: k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop(0xc00164cba0) /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/util/workqueue/delaying_queue.go:276 +0x2ff created by k8s.io/client-go/util/workqueue.newDelayingQueue in goroutine 1 /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/util/workqueue/delaying_queue.go:113 +0x205

goroutine 41 [select]: net/http.(persistConn).writeLoop(0xc0010d19e0) /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/transport.go:2444 +0xf0 created by net/http.(Transport).dialConn in goroutine 50 /opt/hostedtoolcache/go/1.22.4/x64/src/net/http/transport.go:1800 +0x1585

goroutine 55 [sync.Cond.Wait]: sync.runtime_notifyListWait(0xc00166e2e8, 0x1) /opt/hostedtoolcache/go/1.22.4/x64/src/runtime/sema.go:569 +0x159 sync.(Cond).Wait(0xc003f0e140?) /opt/hostedtoolcache/go/1.22.4/x64/src/sync/cond.go:70 +0x85 k8s.io/client-go/tools/cache.(DeltaFIFO).Pop(0xc00166e2c0, 0xc003b1e5d0) /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/tools/cache/delta_fifo.go:575 +0x236 k8s.io/client-go/tools/cache.(controller).processLoop(0xc003b14be0) /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/tools/cache/controller.go:188 +0x30 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.30.2/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003b97f58, {0x3e44ce0, 0xc001c667b0}, 0x1, 0xc00149f1a0) /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.30.2/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc003b97f58, 0x3b9aca00, 0x0, 0x1, 0xc00149f1a0) /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.30.2/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(...) /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.30.2/pkg/util/wait/backoff.go:161 k8s.io/client-go/tools/cache.(controller).Run(0xc003b14be0, 0xc00149f1a0) /home/runner/go/pkg/mod/k8s.io/client-go@v0.30.2/tools/cache/controller.go:159 +0x35e created by github.com/opencost/opencost/pkg/clustercache.(*CachingWatchController).WarmUp in goroutine 82 /home/runner/work/opencost/opencost/opencost/pkg/clustercache/watchcontroller.go:169 +0x98

....

AjayTripathy commented 2 months ago

@umats we're looking; have a sense for the rough size of cluster where this is happening?

cc @cliffcolvin for triage.

ameijer commented 2 months ago

so, looking at this, you can see how big clusters can cause this to choke. Looking at the log, the read is here:

https://github.com/opencost/opencost/blob/69d8e473b60648dcb468964944043e1929a66ee8/pkg/costmodel/costmodel.go#L1428

Which will be hit on every allocation query AFAICT.

That is a pointer to a map coming from here: https://github.com/opencost/opencost/blob/69d8e473b60648dcb468964944043e1929a66ee8/pkg/clustercache/clustercache.go#L213

So, the map is a pointer coming right out of the cache, which explains the single flight errors. I'll bet that the cluster cache is still updating things by the time the query comes.

A relatively straightforward thing we could try out to test this theory is to update

we could try and wrap the existing caching indexer in a thread safe store: https://pkg.go.dev/k8s.io/client-go/tools/cache#NewThreadSafeStore?

ameijer commented 2 months ago

The challenge here is that these maps that are being accessed are owned by the caching k8s API client. So we need to figure out how to obtain a thread safe map out of that... but how do you do that when it is getting written to?

umats commented 2 months ago

@umats we're looking; have a sense for the rough size of cluster where this is happening?

cc @cliffcolvin for triage.

Hi. It's about 800 namespaces running about 5k pods on about 50 nodes