Closed hzproe closed 3 weeks ago
Normally the package related to creating new sessions should create a new session / login when a client gets requested.
As this is just a more informal message from the keepalive handler, are there other messages which indicate that CAPV does not work anymore?
Do you still get reconciliation log messages?
Just to be sure: are you using v1.9.1
or v1.9.0
? (the issue states both)
Hi, we have 0 context around it, we just see the "REST client session expired, clearing session" twice and that's it.
Afterward we see the message below repeated - but no further activities in the log which contacts the vCenter.
On TCP dump we can see the keep alive is still happening but the session to the vCenter is already dead an can't be reestablished.
2024-02-28 08:33:01.870 | {"message":"I0228 07:33:01.870579 1 vimmachine.go:385] \"Updated VSphereVM\" controller=\"vspheremachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"VSphereMachine -- | --We are currently running on 1.9.1
I face a similar Issue, when I try to delete a Cluster nothing happens until I use Clusterctl to delete the vSphere Provider and reinstall it. I use v1.9.1
I face a similar Issue, when I try to delete a Cluster nothing happens until I use Clusterctl to delete the vSphere Provider and reinstall it. I use v1.9.1
For us it is good enough when we restart the capv controller as this reestablishes the connection
I face a similar Issue, when I try to delete a Cluster nothing happens until I use Clusterctl to delete the vSphere Provider and reinstall it. I use v1.9.1
For us it is good enough when we restart the capv controller as this reestablishes the connection
Hi thank for your fast response, so you just delete the pod?
I face a similar Issue, when I try to delete a Cluster nothing happens until I use Clusterctl to delete the vSphere Provider and reinstall it. I use v1.9.1
For us it is good enough when we restart the capv controller as this reestablishes the connection
Hi thank for your fast response, so you just delete the pod?
yes just delete the pod or restart the deployment
I will try that, I'm going to dig through logs and check if i gain more insights
To figure out if it is the keepalive which breaks the functionality for both of you:
You could try to disable the keepalive handler by adding the flag --enable-keep-alive=false
.
Propably related change:
Which got backported to >= v1.8.5 and >= v1.75
To figure out if it is the keepalive which breaks the functionality for both of you:
You could try to disable the keepalive handler by adding the flag
--enable-keep-alive=false
.
Thanks, will try that - seems obvious but we were going in the other direction of having a more aggressive keep alive ... 5minutes down to 3minutes - this didn't help ... but disable might be a better solution
We are trying and will come back after 24h + 1minute
Thank you so much
The session logout issue is caused by the underneath govmomi package, and more details here https://github.com/vmware/govmomi/issues/3240
In CAPV, to mitigate the impact, fixes are discussed in an ongoing PR https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/2601
And the current workaround is to restart CAPV when its got stuck or disable KeepAlive.
The session logout issue is caused by the underneath govmomi package, and more details here vmware/govmomi#3240
In CAPV, to mitigate the impact, fixes are discussed in an ongoing PR #2601
And the current workaround is to restart CAPV when its got stuck or disable KeepAlive.
Totally forgot about that one. Thanks for linking it!
Hi @chrischdi @zhanggbj you were correct, this workaround fixed the issue for us. Thank you very much.
Hi @chrischdi @zhanggbj you were correct, this workaround fixed the issue for us. Thank you very much.
Works for me too, thank you verry much!
Still curious how this happened though (to reproduce it).
Local installation over here does not hit that issue (only has a single workload cluster, running CAPV v1.9.1)
hi, we did try a demo install ... kind cluster + capv 1.9.1 and we observed the same issue when the
--enable-keep-alive flag is set in the deployment.
you get the session expired message and you can no longer create or delete machines as they will be stuck indefinitely until someone reestablishes the session by restarting the capv controller
BR Heinz
--enable-keep-alive=false
solves this issue.true
by default in 1.9.3
and the previous versions. See https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/v1.9.3/pkg/constants/constants.go#L56false
by default in 1.10.0
. See https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/v1.10.0/pkg/constants/constants.go#L561.10.0
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/v1.10.0/main.go#L151The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/kind bug
What steps did you take and what happened: We upgraded CAPV to 1.9.0 from 1.6.1 Connection to vCenter fails after 24hours (86400s) the only hint we can find is in the Informal Logs of CAPV provider
Our assumption is, that the session handler doesn't reestablish a connection after it fails after 1 day.
2024-02-27 08:56:59.635 | {"logtag":"F","logstash_prefix":"logstream-*********-capv-system","message":"I0227 07:56:59.635697 1 session.go:298] \"REST client session expired, clearing session\" controller=\"vspherevm\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"VSphereVM\" VSphereVM=\"z1 -- | --What did you expect to happen: Reconcile for all vSphere machines stops, as the connection to the vCenter will not get reestablished anymore
Anything else you would like to add: We upgraded from 1.6.1 where this was working to 1.9.0 where we observed the first error. We tried downgrading to 1.8.7 with no improvement We tried upgrading to 1.9.1 with no improvement
Environment:
kubectl version
): Server Version: v1.28.6/etc/os-release
): DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS" PRETTY_NAME="Ubuntu 22.04.3 LTS" NAME="Ubuntu"