Open toastwaffle opened 9 months ago
Okay, finally getting around to debugging this. I'm not sure how accurate the thread dump you get from a concurrent map iteration and map write is though.
k8s.io/apimachinery@v0.28.4/pkg/apis/meta/v1/zz_generated.deepcopy.go:689 is copying the Workspace annotations. I think 0xc00117dd40
is the address of the Workspace being copied, but that doesn't appear anywhere else in the thread dump.
It would appear that the controller work queue is only mostly deduplicated. I found some PRs:
I think the only thing we could do to fix this would be to disable the cache for workspace objects (by adding Workspace to the DisableFor
option in the client CacheOptions
. I don't think that is particularly harmful (it means more Get calls to the API server), but equally I'm not totally sure it's worthwhile (as it doesn't solve the problem of multiple concurrent reconciles)
It's worth noting that while crossplane-runtime does configure the controller to recover from panics, concurrent map iteration and map write is not a panic, it's a runtime error which in unrecoverable.
@toastwaffle Thanks for looking into this! Just to make sure I understand the scenario:
Workspace
instanceWorkspace
gets dequeued and starts processingclient.Get
to retrieve a copy of the same Workspace
that is already being processedIs the read and write operating on the same shared client cache?
It's unfortunate that it's a fatal error and not a panic - even two panics, one on each goroutine, would be preferable to an unrecoverable error.
I'm suddenly doubting my hypothesis - if it were 2 concurrent reconciles of the same Workspace, the first must have DeepCopy'd the Workspace as part of the cached Get call, and so any writes to the annotations should be on a different map than the one in the cache. Makes me wonder if there is something reading from the same indexer without doing the deepcopy, but I don't know why such a thing would be writing to the annotations.
My only remaining hypotheses are that something is wrong with the DeepCopy implementation, or that something is wrong with the Golang concurrent iteration/write detection - both are exceedingly unlikely!
I'll think about this some more tomorrow to see if I can come up with anything better...
Okay, I have not been able to come up with any other hypotheses. I'm going to try to run the provider under the race detector, but apparently "memory usage may increase by 5-10x and execution time by 2-20x" :grimacing:
@toastwaffle Any update on this?
HI @bobh66, sorry for dropping the ball on this. Unfortunately other priorities took over, and I've since moved to a different team at my company which isn't using Crossplane. I've lost of my permissions, but I can see from the logs I still have access to that it has happened 4 times in the past 2 weeks - I'll see if I can get somebody on my old team to look into this some more.
What happened?
Full thread dump is here. I intend to do some debugging myself, but creating the issue now.
How can we reproduce it?
Absolutely no idea
What environment did it happen in?