Service Controller can call provider EnsureLoadBalancer with a dirty Service object

mdbooth commented 2 years ago

I observed this with cloud-provider-openstack, but I believe this is a framework bug not a provider bug. Specifically we had a permissions issue which caused this Patch to always fail: https://github.com/kubernetes/cloud-provider-openstack/blob/21ce6dd984e5599ba3c411c960acdbcf9b5dbf03/pkg/openstack/loadbalancer.go#L1783

However, completely deterministically this succeeds the second time it is called. There is no way for this reconcile to succeed without setting an annotation, but after the successful reconciliation we see that the annotation is not set. The only explanation I could muster for this is that we were called with the dirty object from the first reconciliation. Consequently the Patcher is based on a dirty object which already has the annotation set, even though we never succeeded to write that to the API, and therefore the Patcher believes second time round that it doesn't need to write the annotation and we erroneously succeed without having written the required annotation.

The obvious suspect was the cache used by the service controller, however for this specific call path we are always passing in a Service object which came from the informer.

...The informer code is hard to follow...

However, I believe our Service object is ultimately coming from a ThreadSafeStore. 2 things to note about this class:

Its documentation says not to modify anything returned by it
All its accessors return a shallow copy of the cached item

I believe the second thing is the immediate cause of the issue which we hit in cloud-provider-openstack. Annotations, being a slice, are not copied. Therefore the annotations in the second invocation are the same as the first invocation.

Given that:

the documentation says not to modify these items
AFAIK it's not a bug for a reconciler to add an annotation to the object

I believe we need to deep copy the object returned by the informer before passing it to the sync loop.

mdbooth commented 2 years ago

Related (but separate), I think we should remove the serviceCache. The cached was added in https://github.com/kubernetes/kubernetes/commit/fc08a0a71b7a8b00271da28c14c4d0897c502988 in 2015, and I suspect that the reasons it was originally added no longer apply. It is used to update Service object in response to changes to Nodes.

I suspect the reason the cache was added was to avoid fetching Service objects again from the API service. However, we already have these objects cached in the serviceInformer. I believe we can entirely replace this cache with calls to the serviceInformer. We can track failed reconciles by key and fetch them from the informer.

It adds unnecessary complexity to the code and could easily become another source of subtle bugs.

andrewsykim commented 2 years ago

It's possible that bug fix would address your problem though? But there's a follow-up to be had where we remove the cache as you suggested

mdbooth commented 2 years ago

Related: kubernetes/kubernetes#107631 (review)

That's the incidental problem, not the main one, though!

mdbooth commented 2 years ago

It's possible that bug fix would address your problem though? But there's a follow-up to be had where we remove the cache as you suggested

No, I've confused this issue by discussing 2 related but different things. The serviceCache is simply confusing now. It's not the source of the bug.

mdbooth commented 2 years ago

You can see it here: https://github.com/kubernetes/kubernetes/blob/349900472a38a29fd6d85f7e4880d4f3d72ad6ee/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L346-L347

Note that we updated the cache with the service object we were passed from the informer, before passing the object from the informer to syncLoadBalancerIfNeeded(). The dirty object came from the informer, not the serviceCache. We never actually read from the serviceCache in this code path, only write to it.

mdbooth commented 2 years ago

I've thrown up a placeholder PR which hopefully demonstrates what I think the problem is. It's completely untested! I'll try to work on it properly tomorrow.

mdbooth commented 2 years ago

I have manually verified that https://github.com/kubernetes/kubernetes/pull/109601 fixes the issue: I'm now more confident that the issue is that we're modifying the object returned by the informer when we should not. I'll explain how I tested it in the PR, and look for a practical way to add an automated test.

mdbooth commented 2 years ago

I decided to quickly audit the other cloud-provider controllers looking for similar problematic uses of objects returned from an informer:

Node controller looks ok. Looks like it always copies or re-fetches before modifying.

Node lifecycle controller passes shallow-copied Node from informer to:

Route controller looks ok. Looks like it always copies or re-fetches before modifying.

This is quite hard to audit manually. It feels ripe for a 'safe' wrapper used consistently across the codebase. Without anything like rust's ability to separate our mutable from our immutable references automatically, it might be safer to simply always copy these objects before use.

andrewsykim commented 2 years ago

I think up to this point it was implied that the Service object passed from the controller should be read-only but agreed that passing a deep copy is probably the safer thing to do since it's not always obvious

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes / cloud-provider

Service Controller can call provider EnsureLoadBalancer with a dirty Service object #59