kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.59k stars 1.32k forks source link

Watches on remote cluster expires every 10s #8893

Open fabriziopandini opened 1 year ago

fabriziopandini commented 1 year ago

What steps did you take and what happened?

When we create watches on the remote cluster using the cluster cache tracker, watches are created using the same rest config that the cache uses, and currently this rest config has a timeout of 10s.

This leads to having watches on remote clusters expire every 10s, and even if this doesn't seem an issue thanks to watch bookmarks, this isn't ideal and we should figure out a way to have a longer timeout for the cache than the one used for the remote clients.

What did you expect to happen?

Cluster API to create watches on remote machines with longer timeouts

Cluster API version

main, old versions

Kubernetes version

No response

Anything else you would like to add?

The tricky part of this is that we want to keep the 10-second/short timeout for the remote client, and we also want to use it while creating a new remote client because this timeout is blocking a reconcile loop in a synchronous way.

However, unfortunately, when creating a new remote client we are also creating indexes on the cache, and thus a longer timeout on the cache will block unless we find a way to initiate indexes asyc

Label(s) to be applied

/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

sbueringer commented 1 year ago

I think the problem is bigger than just us adding indexes on client creation. The same can happen for every Get/List call in our reconcilers. All of them can trigger the creation of an informer.

I think we have to take a closer look how all of this works in CR. I think there could be a way to use a different http client for the synchronous calls during informer creation vs ListWatch. Code is roughly here: https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/cache/internal/informers.go#L377-L503

This could e.g. work by passing into the cache:

POC: https://github.com/sbueringer/cluster-api/pull/151/files#

But not sure if that covers everything relevant. It should be definitely cleaner to implement this in CR. I think my current POC is just exploting the current behavior which is not guaranteed to stay the same.

sbueringer commented 1 year ago

/triage accepted

fabriziopandini commented 7 months ago

/priority important-soon

fabriziopandini commented 6 months ago

/assign @sbueringer please ping me when you do the investigation

k8s-triage-robot commented 3 months ago

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

sbueringer commented 3 months ago

/triage accepted

k8s-triage-robot commented 3 weeks ago

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

sbueringer commented 3 weeks ago

/triage accepted