upbound / provider-terraform

A @crossplane provider for Terraform
Apache License 2.0
124 stars 55 forks source link

concurrent map iteration and map write panic #243

Open toastwaffle opened 4 months ago

toastwaffle commented 4 months ago

What happened?

fatal error: concurrent map iteration and map write

goroutine 169 [running]:
k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).DeepCopyInto(0xc00117dd60, 0xc000ffc4a0)
        k8s.io/apimachinery@v0.28.4/pkg/apis/meta/v1/zz_generated.deepcopy.go:689 +0x396
github.com/upbound/provider-terraform/apis/v1beta1.(*Workspace).DeepCopyInto(0xc00117dd40, 0xc000ffc480)
        github.com/upbound/provider-terraform/apis/v1beta1/zz_generated.deepcopy.go:369 +0xc5
github.com/upbound/provider-terraform/apis/v1beta1.(*Workspace).DeepCopy(...)
        github.com/upbound/provider-terraform/apis/v1beta1/zz_generated.deepcopy.go:380
github.com/upbound/provider-terraform/apis/v1beta1.(*Workspace).DeepCopyObject(0xc00117dd40?)
        github.com/upbound/provider-terraform/apis/v1beta1/zz_generated.deepcopy.go:386 +0x45
sigs.k8s.io/controller-runtime/pkg/cache/internal.(*CacheReader).Get(0xc000349900, {0xc000642b40?, 0x1f52e2e?}, {{0x0?, 0x1f4a136?}, {0xc000c05460?, 0x1ae2b79?}}, {0x23fc040, 0xc00140b680}, {0x0, ...})
        sigs.k8s.io/controller-runtime@v0.16.3/pkg/cache/internal/cache_reader.go:88 +0x145
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).Get(0xc0000109f0, {0x23e5740, 0xc000642b40}, {{0x0?, 0x40bd45?}, {0xc000c05460?, 0x1f0ac20?}}, {0x23fc040?, 0xc00140b680?}, {0x0, ...})
        sigs.k8s.io/controller-runtime@v0.16.3/pkg/cache/informer_cache.go:88 +0x1f9
sigs.k8s.io/controller-runtime/pkg/client.(*client).Get(0xc000448360, {0x23e5740, 0xc000642b40}, {{0x0?, 0x0?}, {0xc000c05460?, 0xc0004d4d80?}}, {0x23fc040?, 0xc00140b680?}, {0x0, ...})
        sigs.k8s.io/controller-runtime@v0.16.3/pkg/client/client.go:348 +0x4c4
github.com/crossplane/crossplane-runtime/pkg/reconciler/managed.(*Reconciler).Reconcile(0xc0000cb770, {0x23e5778, 0xc000642900}, {{{0x0, 0x0}, {0xc000c05460, 0x1c}}})
        github.com/crossplane/crossplane-runtime@v1.14.0-rc.0.0.20230912122805-43c9ceeb2071/pkg/reconciler/managed/reconciler.go:741 +0x2f9
github.com/crossplane/crossplane-runtime/pkg/ratelimiter.(*Reconciler).Reconcile(0xc0002b0d20, {0x23e5778, 0xc000642900}, {{{0x0?, 0x5?}, {0xc000c05460?, 0xc000dd5d48?}}})
        github.com/crossplane/crossplane-runtime@v1.14.0-rc.0.0.20230912122805-43c9ceeb2071/pkg/ratelimiter/reconciler.go:54 +0x16b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x23e8c50?, {0x23e5778?, 0xc000642900?}, {{{0x0?, 0xb?}, {0xc000c05460?, 0x0?}}})
        sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003daaa0, {0x23e56d0, 0xc00013dbd0}, {0x1d4a1c0?, 0xc0004551a0?})
        sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316 +0x3ca
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003daaa0, {0x23e56d0, 0xc00013dbd0})
        sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:223 +0x587

Full thread dump is here. I intend to do some debugging myself, but creating the issue now.

How can we reproduce it?

Absolutely no idea

What environment did it happen in?

toastwaffle commented 3 months ago

Okay, finally getting around to debugging this. I'm not sure how accurate the thread dump you get from a concurrent map iteration and map write is though.

k8s.io/apimachinery@v0.28.4/pkg/apis/meta/v1/zz_generated.deepcopy.go:689 is copying the Workspace annotations. I think 0xc00117dd40 is the address of the Workspace being copied, but that doesn't appear anywhere else in the thread dump.

It would appear that the controller work queue is only mostly deduplicated. I found some PRs:

I think the only thing we could do to fix this would be to disable the cache for workspace objects (by adding Workspace to the DisableFor option in the client CacheOptions. I don't think that is particularly harmful (it means more Get calls to the API server), but equally I'm not totally sure it's worthwhile (as it doesn't solve the problem of multiple concurrent reconciles)

It's worth noting that while crossplane-runtime does configure the controller to recover from panics, concurrent map iteration and map write is not a panic, it's a runtime error which in unrecoverable.

bobh66 commented 3 months ago

@toastwaffle Thanks for looking into this! Just to make sure I understand the scenario:

Is the read and write operating on the same shared client cache?

It's unfortunate that it's a fatal error and not a panic - even two panics, one on each goroutine, would be preferable to an unrecoverable error.

toastwaffle commented 3 months ago

I'm suddenly doubting my hypothesis - if it were 2 concurrent reconciles of the same Workspace, the first must have DeepCopy'd the Workspace as part of the cached Get call, and so any writes to the annotations should be on a different map than the one in the cache. Makes me wonder if there is something reading from the same indexer without doing the deepcopy, but I don't know why such a thing would be writing to the annotations.

My only remaining hypotheses are that something is wrong with the DeepCopy implementation, or that something is wrong with the Golang concurrent iteration/write detection - both are exceedingly unlikely!

I'll think about this some more tomorrow to see if I can come up with anything better...

toastwaffle commented 3 months ago

Okay, I have not been able to come up with any other hypotheses. I'm going to try to run the provider under the race detector, but apparently "memory usage may increase by 5-10x and execution time by 2-20x" :grimacing:

bobh66 commented 6 days ago

@toastwaffle Any update on this?

toastwaffle commented 3 days ago

HI @bobh66, sorry for dropping the ball on this. Unfortunately other priorities took over, and I've since moved to a different team at my company which isn't using Crossplane. I've lost of my permissions, but I can see from the logs I still have access to that it has happened 4 times in the past 2 weeks - I'll see if I can get somebody on my old team to look into this some more.