openshift / machine-config-operator

Apache License 2.0
245 stars 401 forks source link

OCPBUGS-33129: Panic when we remove an OCB infra MCP and we try to create new ones with different names #4396

Closed inesqyx closed 2 months ago

inesqyx commented 3 months ago

Avoid non-existing MOSC de-reference.

openshift-ci-robot commented 3 months ago

@inesqyx: This pull request references Jira Issue OCPBUGS-33129, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4396): >WIP Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 3 months ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

inesqyx commented 3 months ago

/test e2e-gcp-op-techpreview

inesqyx commented 3 months ago

/retest-required

inesqyx commented 3 months ago

/test-required

inesqyx commented 3 months ago

/jira refresh

openshift-ci-robot commented 3 months ago

@inesqyx: This pull request references Jira Issue OCPBUGS-33129, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

In response to [this](https://github.com/openshift/machine-config-operator/pull/4396#issuecomment-2155552558): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
inesqyx commented 3 months ago

/test unit

inesqyx commented 2 months ago

/test e2e-gcp-op-techpreview

inesqyx commented 2 months ago

/retest-required

openshift-ci-robot commented 2 months ago

@inesqyx: This pull request references Jira Issue OCPBUGS-33129, which is valid.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @sergiordlr

In response to [this](https://github.com/openshift/machine-config-operator/pull/4396): >Avoid non-existing MOSC de-reference. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
sergiordlr commented 2 months ago

I've tested the previous commit 8ecaa8d . The panic error is not in the machine-config-controller pod anymore, but we can find a panic in the machine-os-builder pod.

This is the panic logs

E0610 14:17:43.884756       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 92 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1fd8000?, 0x3962a90})
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000489ea0?})
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x1fd8000?, 0x3962a90?})
        /usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/machine-config-operator/pkg/controller/common.(*MachineOSBuildState).IsBuilding(...)
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/common/mos_state.go:71
github.com/openshift/machine-config-operator/pkg/controller/build.(*Controller).customBuildPodUpdater(0xc0007adea0, 0xc0008c4000)
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/build_controller.go:430 +0x4b2
github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).syncPod(0xc00037fe30, {0xc000591980, 0x57})
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:159 +0x610
github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).processNextWorkItem(0xc00037fe30)
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:332 +0xc4
github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).worker(...)
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:321
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x27024a0, 0xc0007811d0}, 0x1, 0xc000165380)
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x1e
created by github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).Run in goroutine 45
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:183 +0x1e5
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x148 pc=0x1ceda32]

I was able to reproduce it with commit 0311f03 .

It seems that we can only reproduce itermittently though. It took me 2 tries to reproduce the panic.

Same steps

  1. Create infra pool
  2. Create MOSC for infra pool
  3. Wait for build pod to run
  4. Remove MSOC
  5. A panic happens in machine-os-builder pod (instead of MCC).
sinnykumari commented 2 months ago

LGTM pending QE verification

inesqyx commented 2 months ago

I've tested the previous commit 8ecaa8d . The panic error is not in the machine-config-controller pod anymore, but we can find a panic in the machine-os-builder pod.

This is the panic logs

E0610 14:17:43.884756       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 92 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1fd8000?, 0x3962a90})
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000489ea0?})
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x1fd8000?, 0x3962a90?})
        /usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/machine-config-operator/pkg/controller/common.(*MachineOSBuildState).IsBuilding(...)
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/common/mos_state.go:71
github.com/openshift/machine-config-operator/pkg/controller/build.(*Controller).customBuildPodUpdater(0xc0007adea0, 0xc0008c4000)
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/build_controller.go:430 +0x4b2
github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).syncPod(0xc00037fe30, {0xc000591980, 0x57})
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:159 +0x610
github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).processNextWorkItem(0xc00037fe30)
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:332 +0xc4
github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).worker(...)
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:321
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x27024a0, 0xc0007811d0}, 0x1, 0xc000165380)
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
        /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x1e
created by github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).Run in goroutine 45
        /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:183 +0x1e5
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x148 pc=0x1ceda32]

I was able to reproduce it with commit 0311f03 .

It seems that we can only reproduce itermittently though. It took me 2 tries to reproduce the panic.

Same steps

  1. Create infra pool
  2. Create MOSC for infra pool
  3. Wait for build pod to run
  4. Remove MSOC
  5. A panic happens in machine-os-builder pod (instead of MCC).

I think I figured out where went wrong here. So two panic cases comes from: (1) MOSC got deleted during build, de-reference the MOSC to find the MOSB for updateMachineOSBuild panics (2) MOSC got deleted during build and MOSB got garbage collected, de-reference the MOSB to read build status panics The reason behind why MOSC is deleted in the middle if not manual and intentional needs further investigation.

inesqyx commented 2 months ago

Current behaviour:

inesqyx commented 2 months ago

/retest-required

sergiordlr commented 2 months ago

Verified using IPI on AWS

  1. Create an infra pool
  2. Create a MOSC resource pointing to the infra pool
  3. Wait for the build pod to be running
  4. Remove the the MOSC resource

No panic happened in the controller pod or the machine-os-builder pod.

We can see this log in the machine-os-builder pod instead of the panic

I0611 10:21:16.101690 1 pod_build_controller.go:296] Error syncing pod openshift-machine-config-operator/build-rendered-infra-ad00d24655a1d28893279f65a6b807f1: unable to update with build pod status: Missing MOSC/MOSB for pool infra

/label qe-approved

sinnykumari commented 2 months ago

Thanks Ines for working on the fix and Sergio for verifying the fix! /lgtm

sinnykumari commented 2 months ago

/approve

openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: inesqyx, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/machine-config-operator/blob/master/OWNERS)~~ [sinnykumari] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: inesqyx, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/machine-config-operator/blob/master/OWNERS)~~ [sinnykumari] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 2 months ago

@inesqyx: Jira Issue OCPBUGS-33129: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-33129 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4396): >Avoid non-existing MOSC de-reference. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 2 months ago

@inesqyx: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-upgrade-out-of-change 1f53051a2c28f7d4cbdefeae20825dab938bd474 link false /test e2e-aws-ovn-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-bot commented 2 months ago

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.17.0-202406111641.p0.g0ec5cd1.assembly.stream.el9 for distgit ose-machine-config-operator. All builds following this will include this PR.

inesqyx commented 2 months ago

/cherry-pick release-4.16

openshift-cherrypick-robot commented 2 months ago

@inesqyx: new pull request created: #4403

In response to [this](https://github.com/openshift/machine-config-operator/pull/4396#issuecomment-2161386011): >/cherry-pick release-4.16 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.