Closed MPV closed 3 years ago
502 is server side error code. Please check Load Balancer service or 'The proxy server' which recorded in the log.
yes, it's likely a capi issue, might be openstack services related
I assume we should be able to handle a single 502, because I would have expected that we will retry the cluster reoncilation. Of course we cannot progress further when retries are not able to create the loadbalancer as the following steps depend on a working lb.
It's worth noting that we think we have a workaround: deleting each OpenStackMachine
for the control plane so it recreates a new one (only one at a time). Then we don't see this error again, and reconciliation isn't paused.
After doing the above, we haven't seen this issue again yet (which showed up directly after creation of new clusters, which we haven't done again yet). Soon we will create new clusters again, and can then report back again to confirm if this issue keeps happening over time (and wasn't just a temporary fluke at our suppliers side).
I assume we should be able to handle a single 502, because I would have expected that we will retry the cluster reoncilation. Of course we cannot progress further when retries are not able to create the loadbalancer as the following steps depend on a working lb.
I agree. 👍
I see the error is being logged from here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/1358307c47d97b5dd502b3317ddbc689c4b0983e/controllers/openstackmachine_controller.go#L328
Any ideas on what would need changing to keep retrying reconciliation?
@MPV Is that the first error you get? I would have guessed that the first error leads to a machine in error state and from that one it hits this line.
So the question might be, what should we do the first time we're hitting this issue. Right now I think the machine get's in error state and then there will be no further retries
@MPV Is that the first error you get? I would have guessed that the first error leads to a machine in error state and from that one it hits this line.
So the question might be, what should we do the first time we're hitting this issue. Right now I think the machine get's in error state and then there will be no further retries
Ah, my bad, in my last comment I was referring to the newest line (newest on top in the issue description).
The oldest log line was the one with controller-runtime/controller "msg"="Reconciler error" "error"="failed to reconcile load balancer: Expected HTTP response code [200 204 300] when accessing
, so from here instead: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/1358307c47d97b5dd502b3317ddbc689c4b0983e/controllers/openstackcluster_controller.go#L433
Here's an overview of the lines we get in the capo-system
namespace before these errors:
@MPV Just that I get the timeline correctly:
The problematic thing is:
FailureReason
and FailureMessage
)What should actually happen (imho):
ERROR
Proposed fix:
FailureReason
and FailureMessage
when there is no way to recover via simple retriesRecovering VMs from the real terminal error state should then be possible via the MachineHealthChecks
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-contributor-experience at kubernetes/community. /close
@k8s-triage-robot: Closing this issue.
@MPV
Just that I get the timeline correctly:
cluster controller was able to create the lb
machine controller tried to add lb member for a new machine, got a 502 from OpenStack and the machine went into "error state"
The problematic thing is:
machine error state is a terminal state (afaik we're triggering it by setting
FailureReason
andFailureMessage
)that means once the machine is in this state there is no way it will recover, i.e. the controller reconciles again
What should actually happen (imho):
- we only enter the terminal error state when there is no way we can fix the machine by retrying. In my experience e.g. when the server in OpenStack get's into state
ERROR
Proposed fix:
- go over the code and ensure we're only setting
FailureReason
andFailureMessage
when there is no way to recover via simple retriesRecovering VMs from the real terminal error state should then be possible via the MachineHealthChecks
@sbueringer Yes, that makes sense and sounds like a sensible solution to me. Looks like the bot closed this. Still worth looking into though?
/reopen
this close is due to no activity for long time :)
@jichenjc: Reopened this issue.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
/kind bug
What steps did you take and what happened:
Creating a cluster. It got stuck in reconciliation, with errors like this:
controllers/OpenStackMachine "msg"="Error state detected, skipping reconciliation" "cluster"="prod-1-cluster-1" "machine"="prod-1-kubeadm-control-plane-1-zglwk" "namespace"="prod-1" "openStackCluster"="prod-1-os-cluster-1" "openStackMachine"="prod-1-machine-tmpl-control-plane-1-sn99q"
controllers/OpenStackMachine "msg"="LoadBalancerMember cannot be reconciled: network.APIServerLoadBalancer is not yet available in openStackCluster.Status" "error"="UpdateError" "cluster"="prod-1-cluster-1" "machine"="prod-1-kubeadm-control-plane-1-zglwk" "namespace"="prod-1" "openStackCluster"="prod-1-os-cluster-1" "openStackMachine"="prod-1-machine-tmpl-control-plane-1-sn99q" "
controller-runtime/controller "msg"="Reconciler error" "error"="failed to reconcile load balancer: Expected HTTP response code [200 204 300] when accessing [GET https://load-balancer-xyz.our-provider.cloud:9876/v2.0/lbaas/listeners?name=k8s-clusterapi-cluster-prod-1-prod-1-cluster-1-kubeapi-6443], but got 502 instead\n\u003c!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\"\u003e\n\u003chtml\u003e\u003chead\u003e\n\u003ctitle\u003e502 Proxy Error\u003c/title\u003e\n\u003c/head\u003e\u003cbody\u003e\n\u003ch1\u003eProxy Error\u003c/h1\u003e\n\u003cp\u003eThe proxy server received an invalid\r\nresponse from an upstream server.\u003cbr /\u003e\r\nThe proxy server could not handle the request\u003cp\u003eReason: \u003cstrong\u003eError reading from remote server\u003c/strong\u003e\u003c/p\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003caddress\u003eApache/2.4.29 (Ubuntu) Server at load-balancer-xyz.our-provider.cloud Port 9876\u003c/address\u003e\n\u003c/body\u003e\u003c/html\u003e\n" "controller"="openstackcluster" "name"="prod-1-os-cluster-1" "namespace"="prod-1""
What did you expect to happen:
CAPO reconciliation for the cluster not becoming blocked by a (temporary?) 502 from the load-balancer API.
Anything else you would like to add:
Environment:
git rev-parse HEAD
if manually built): TBCkubectl version
):/etc/os-release
):