LoadBalancerMember cannot be reconciled: network.APIServerLoadBalancer is not yet available in openStackCluster.Status

MPV commented 3 years ago

/kind bug

What steps did you take and what happened:

Creating a cluster. It got stuck in reconciliation, with errors like this:

controllers/OpenStackMachine "msg"="Error state detected, skipping reconciliation" "cluster"="prod-1-cluster-1" "machine"="prod-1-kubeadm-control-plane-1-zglwk" "namespace"="prod-1" "openStackCluster"="prod-1-os-cluster-1" "openStackMachine"="prod-1-machine-tmpl-control-plane-1-sn99q"

controllers/OpenStackMachine "msg"="LoadBalancerMember cannot be reconciled: network.APIServerLoadBalancer is not yet available in openStackCluster.Status" "error"="UpdateError" "cluster"="prod-1-cluster-1" "machine"="prod-1-kubeadm-control-plane-1-zglwk" "namespace"="prod-1" "openStackCluster"="prod-1-os-cluster-1" "openStackMachine"="prod-1-machine-tmpl-control-plane-1-sn99q" "

controller-runtime/controller "msg"="Reconciler error" "error"="failed to reconcile load balancer: Expected HTTP response code [200 204 300] when accessing [GET https://load-balancer-xyz.our-provider.cloud:9876/v2.0/lbaas/listeners?name=k8s-clusterapi-cluster-prod-1-prod-1-cluster-1-kubeapi-6443], but got 502 instead\n\u003c!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\"\u003e\n\u003chtml\u003e\u003chead\u003e\n\u003ctitle\u003e502 Proxy Error\u003c/title\u003e\n\u003c/head\u003e\u003cbody\u003e\n\u003ch1\u003eProxy Error\u003c/h1\u003e\n\u003cp\u003eThe proxy server received an invalid\r\nresponse from an upstream server.\u003cbr /\u003e\r\nThe proxy server could not handle the request\u003cp\u003eReason: \u003cstrong\u003eError reading from remote server\u003c/strong\u003e\u003c/p\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003caddress\u003eApache/2.4.29 (Ubuntu) Server at load-balancer-xyz.our-provider.cloud Port 9876\u003c/address\u003e\n\u003c/body\u003e\u003c/html\u003e\n" "controller"="openstackcluster" "name"="prod-1-os-cluster-1" "namespace"="prod-1""

What did you expect to happen:

CAPO reconciliation for the cluster not becoming blocked by a (temporary?) 502 from the load-balancer API.

Anything else you would like to add:

Environment:

Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): TBC
Cluster-API version: v0.3.14
OpenStack version: TBC
Kubernetes version (use kubectl version):
OS (e.g. from /etc/os-release):

hidekazuna commented 3 years ago

502 is server side error code. Please check Load Balancer service or 'The proxy server' which recorded in the log.

jichenjc commented 3 years ago

yes, it's likely a capi issue, might be openstack services related

sbueringer commented 3 years ago

I assume we should be able to handle a single 502, because I would have expected that we will retry the cluster reoncilation. Of course we cannot progress further when retries are not able to create the loadbalancer as the following steps depend on a working lb.

MPV commented 3 years ago

It's worth noting that we think we have a workaround: deleting each OpenStackMachine for the control plane so it recreates a new one (only one at a time). Then we don't see this error again, and reconciliation isn't paused.

After doing the above, we haven't seen this issue again yet (which showed up directly after creation of new clusters, which we haven't done again yet). Soon we will create new clusters again, and can then report back again to confirm if this issue keeps happening over time (and wasn't just a temporary fluke at our suppliers side).

MPV commented 3 years ago

I assume we should be able to handle a single 502, because I would have expected that we will retry the cluster reoncilation. Of course we cannot progress further when retries are not able to create the loadbalancer as the following steps depend on a working lb.

I agree. 👍

I see the error is being logged from here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/1358307c47d97b5dd502b3317ddbc689c4b0983e/controllers/openstackmachine_controller.go#L328

Any ideas on what would need changing to keep retrying reconciliation?

sbueringer commented 3 years ago

@MPV Is that the first error you get? I would have guessed that the first error leads to a machine in error state and from that one it hits this line.

So the question might be, what should we do the first time we're hitting this issue. Right now I think the machine get's in error state and then there will be no further retries

MPV commented 3 years ago

@MPV Is that the first error you get? I would have guessed that the first error leads to a machine in error state and from that one it hits this line.

So the question might be, what should we do the first time we're hitting this issue. Right now I think the machine get's in error state and then there will be no further retries

Ah, my bad, in my last comment I was referring to the newest line (newest on top in the issue description).

The oldest log line was the one with controller-runtime/controller "msg"="Reconciler error" "error"="failed to reconcile load balancer: Expected HTTP response code [200 204 300] when accessing, so from here instead: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/1358307c47d97b5dd502b3317ddbc689c4b0983e/controllers/openstackcluster_controller.go#L433

Here's an overview of the lines we get in the capo-system namespace before these errors:

sbueringer commented 3 years ago

@MPV Just that I get the timeline correctly:

cluster controller was able to create the lb
machine controller tried to add lb member for a new machine, got a 502 from OpenStack and the machine went into "error state"

The problematic thing is:

machine error state is a terminal state (afaik we're triggering it by setting FailureReason and FailureMessage)
that means once the machine is in this state there is no way it will recover, i.e. the controller reconciles again

What should actually happen (imho):

we only enter the terminal error state when there is no way we can fix the machine by retrying. In my experience e.g. when the server in OpenStack get's into state ERROR

Proposed fix:

go over the code and ensure we're only setting FailureReason and FailureMessage when there is no way to recover via simple retries

Recovering VMs from the real terminal error state should then be possible via the MachineHealthChecks

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

k8s-triage-robot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/742#issuecomment-886996633): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

MPV commented 3 years ago

@MPV

Just that I get the timeline correctly:

cluster controller was able to create the lb

machine controller tried to add lb member for a new machine, got a 502 from OpenStack and the machine went into "error state"

The problematic thing is:

machine error state is a terminal state (afaik we're triggering it by setting FailureReason and FailureMessage)

that means once the machine is in this state there is no way it will recover, i.e. the controller reconciles again

What should actually happen (imho):

we only enter the terminal error state when there is no way we can fix the machine by retrying. In my experience e.g. when the server in OpenStack get's into state ERROR

Proposed fix:

go over the code and ensure we're only setting FailureReason and FailureMessage when there is no way to recover via simple retries

Recovering VMs from the real terminal error state should then be possible via the MachineHealthChecks

@sbueringer Yes, that makes sense and sounds like a sensible solution to me. Looks like the bot closed this. Still worth looking into though?

jichenjc commented 3 years ago

/reopen

this close is due to no activity for long time :)

k8s-ci-robot commented 3 years ago

@jichenjc: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/742#issuecomment-902384579): >/reopen > >this close is due to no activity for long time :) Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 3 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/742#issuecomment-922406186): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api-provider-openstack

LoadBalancerMember cannot be reconciled: network.APIServerLoadBalancer is not yet available in openStackCluster.Status #742