openshift-kni / lifecycle-agent

Local agent for orchestration of SNO Image Based Upgrade
Apache License 2.0
6 stars 26 forks source link

OCPBUGS-32493: improve prep stage error handling to allow for unexpected apiserver or network issues #541

Closed pixelsoccupied closed 1 month ago

pixelsoccupied commented 1 month ago

Background / Context

Cluster can expect to run into unexpected network issues and it's up to the controller to eventually reach desired state

Issue / Requirement / Reason for change

We ran into network outage around the time when Prep was trying to Get a deployment CR for the first time. Because it was the first time...controller needed to reach out to the API server in order to init a new informer. At which point we see

failed to get API group resources: unable to retrieve the complete list of server APIs: apps/v1: Get "https://[fd02::1]:443/apis/apps/v1": dial tcp [fd02::1]:443: connect: connection refuse

Solution / Feature Overview

Controller should allow time for the API server outage to heal and retry until then

Implementation Details

Other Information

Tip

To force an API outage

# bring down
mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp/kube-apiserver-pod.yaml

# bring up
mv /tmp/kube-apiserver-pod.yaml /etc/kubernetes/manifests/kube-apiserver-pod.yaml

(At this you need to check with crictl for logs)

/cc @donpenney

openshift-ci-robot commented 1 month ago

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32493, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift-kni/lifecycle-agent/pull/541): ># Background / Context > >Cluster can expect to run into unexpected network issues and it's up to the controller to eventually reach desired state > ># Issue / Requirement / Reason for change > >We ran into network outage around the time when `Prep` was trying to `Get` a `deployment` CR for the first time. Because it was the first time...controller needed to reach out to the API server in order to init a new informer. At which point we see >``` >failed to get API group resources: unable to retrieve the complete list of server APIs: apps/v1: Get "https://[fd02::1]:443/apis/apps/v1": dial tcp [fd02::1]:443: connect: connection refuse >``` > ># Solution / Feature Overview > >Controller should allow time for the API server outage to heal and retry until then > ># Implementation Details > >- We are now requeue with error for functions that needs to reach to API server. Unrecoverable error such as invalid IBU spec still stops the reconcile as usual. >- Health checks are now moved out to be called at every reconcile. During normal operations reconcile still fast and when the cluster is healing it will simply requeue before checking the `Prep` stage state. >- Increase the logs to make it easier to trace > ># Other Information > >- As part of the increased logging...I added a way to update the logger name which should now allow us quickly identify stage specific logs (can push to it another PR but this really helped me with this bug) > >Tip > >To force an API outage >``` ># bring down >mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp/kube-apiserver-pod.yaml > ># bring up >mv /tmp/kube-apiserver-pod.yaml /etc/kubernetes/manifests/kube-apiserver-pod.yaml >``` > >(At this you need to check with crictl for logs) > >/cc @donpenney Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift-kni%2Flifecycle-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 1 month ago

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32493, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to [this](https://github.com/openshift-kni/lifecycle-agent/pull/541): ># Background / Context > >Cluster can expect to run into unexpected network issues and it's up to the controller to eventually reach desired state > ># Issue / Requirement / Reason for change > >We ran into network outage around the time when `Prep` was trying to `Get` a `deployment` CR for the first time. Because it was the first time...controller needed to reach out to the API server in order to init a new informer. At which point we see >``` >failed to get API group resources: unable to retrieve the complete list of server APIs: apps/v1: Get "https://[fd02::1]:443/apis/apps/v1": dial tcp [fd02::1]:443: connect: connection refuse >``` > ># Solution / Feature Overview > >Controller should allow time for the API server outage to heal and retry until then > ># Implementation Details > >- We are now requeue with error for functions that needs to reach to API server. >- Health checks are now moved out to be called at every reconcile. During normal operations reconcile still fast and when the cluster is healing it will simply requeue before checking the `Prep` stage state. >- Increase the logs to make it easier to trace > ># Other Information > >- As part of the increased logging...I added a way to update the logger name which should now allow us quickly identify stage specific logs (can push to it another PR but this really helped me with this bug) > >Tip > >To force an API outage >``` ># bring down >mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp/kube-apiserver-pod.yaml > ># bring up >mv /tmp/kube-apiserver-pod.yaml /etc/kubernetes/manifests/kube-apiserver-pod.yaml >``` > >(At this you need to check with crictl for logs) > >/cc @donpenney Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift-kni%2Flifecycle-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 1 month ago

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32493, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to [this](https://github.com/openshift-kni/lifecycle-agent/pull/541): ># Background / Context > >Cluster can expect to run into unexpected network issues and it's up to the controller to eventually reach desired state > ># Issue / Requirement / Reason for change > >We ran into network outage around the time when `Prep` was trying to `Get` a `deployment` CR for the first time. Because it was the first time...controller needed to reach out to the API server in order to init a new informer. At which point we see >``` >failed to get API group resources: unable to retrieve the complete list of server APIs: apps/v1: Get "https://[fd02::1]:443/apis/apps/v1": dial tcp [fd02::1]:443: connect: connection refuse >``` > ># Solution / Feature Overview > >Controller should allow time for the API server outage to heal and retry until then > ># Implementation Details > >- We are now requeue with error for functions that needs to reach to API server. For `validateIBUSpec` we are being selective with the requeue since it's more likely from user rather than network >- Health checks are now moved out to be called at every reconcile. During normal operations reconcile still fast and when the cluster is healing it will simply requeue before checking the `Prep` stage state. >- Increase the logs to make it easier to trace > ># Other Information > >- As part of the increased logging...I added a way to update the logger name which should now allow us quickly identify stage specific logs (can push to it another PR but this really helped me with this bug) > >Tip > >To force an API outage >``` ># bring down >mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp/kube-apiserver-pod.yaml > ># bring up >mv /tmp/kube-apiserver-pod.yaml /etc/kubernetes/manifests/kube-apiserver-pod.yaml >``` > >(At this you need to check with crictl for logs) > >/cc @donpenney Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift-kni%2Flifecycle-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
pixelsoccupied commented 1 month ago

/cc @Missxiaoguo

As discussed we are being a bit more selective with validateIBUSpec since the likely reason of error from it is cause of user error

But we are still requeueing with known network error (about 4 types)...and they of course cover the error we are focusing here i.e ..connection refused due to API server down

pixelsoccupied commented 1 month ago

/hold

doing a bit of testing with the new change

pixelsoccupied commented 1 month ago

/unhold

pixelsoccupied commented 1 month ago

/jira refresh

openshift-ci-robot commented 1 month ago

@pixelsoccupied: This pull request references Jira Issue OCPBUGS-32493, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @yliu127

In response to [this](https://github.com/openshift-kni/lifecycle-agent/pull/541#issuecomment-2134036091): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift-kni%2Flifecycle-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
jc-rh commented 1 month ago

/retest

Missxiaoguo commented 1 month ago

/lgtm

jc-rh commented 1 month ago

/approve

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jc-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift-kni/lifecycle-agent/blob/main/OWNERS)~~ [jc-rh] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 1 month ago

@pixelsoccupied: Jira Issue OCPBUGS-32493: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-32493 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift-kni/lifecycle-agent/pull/541): ># Background / Context > >Cluster can expect to run into unexpected network issues and it's up to the controller to eventually reach desired state > ># Issue / Requirement / Reason for change > >We ran into network outage around the time when `Prep` was trying to `Get` a `deployment` CR for the first time. Because it was the first time...controller needed to reach out to the API server in order to init a new informer. At which point we see >``` >failed to get API group resources: unable to retrieve the complete list of server APIs: apps/v1: Get "https://[fd02::1]:443/apis/apps/v1": dial tcp [fd02::1]:443: connect: connection refuse >``` > ># Solution / Feature Overview > >Controller should allow time for the API server outage to heal and retry until then > ># Implementation Details > >- We are now requeue with error for functions that needs to reach to API server. For `validateIBUSpec` we are being selective with the requeue since it's more likely from user rather than network >- Health checks are now moved out to be called at every reconcile. During normal operations reconcile still fast and when the cluster is healing it will simply requeue before checking the `Prep` stage state. >- Increase the logs to make it easier to trace > ># Other Information > >- As part of the increased logging...I added a way to update the logger name which should now allow us quickly identify stage specific logs (can push to it another PR but this really helped me with this bug) > >Tip > >To force an API outage >``` ># bring down >mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp/kube-apiserver-pod.yaml > ># bring up >mv /tmp/kube-apiserver-pod.yaml /etc/kubernetes/manifests/kube-apiserver-pod.yaml >``` > >(At this you need to check with crictl for logs) > >/cc @donpenney Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift-kni%2Flifecycle-agent). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.