Closed inesqyx closed 2 months ago
@inesqyx: This pull request references Jira Issue OCPBUGS-33129, which is invalid:
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
/test e2e-gcp-op-techpreview
/retest-required
/test-required
/jira refresh
@inesqyx: This pull request references Jira Issue OCPBUGS-33129, which is valid. The bug has been moved to the POST state.
Requesting review from QA contact: /cc @sergiordlr
/test unit
/test e2e-gcp-op-techpreview
/retest-required
@inesqyx: This pull request references Jira Issue OCPBUGS-33129, which is valid.
Requesting review from QA contact: /cc @sergiordlr
I've tested the previous commit 8ecaa8d . The panic error is not in the machine-config-controller pod anymore, but we can find a panic in the machine-os-builder pod.
This is the panic logs
E0610 14:17:43.884756 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 92 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1fd8000?, 0x3962a90})
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000489ea0?})
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x1fd8000?, 0x3962a90?})
/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/machine-config-operator/pkg/controller/common.(*MachineOSBuildState).IsBuilding(...)
/go/src/github.com/openshift/machine-config-operator/pkg/controller/common/mos_state.go:71
github.com/openshift/machine-config-operator/pkg/controller/build.(*Controller).customBuildPodUpdater(0xc0007adea0, 0xc0008c4000)
/go/src/github.com/openshift/machine-config-operator/pkg/controller/build/build_controller.go:430 +0x4b2
github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).syncPod(0xc00037fe30, {0xc000591980, 0x57})
/go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:159 +0x610
github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).processNextWorkItem(0xc00037fe30)
/go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:332 +0xc4
github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).worker(...)
/go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:321
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x27024a0, 0xc0007811d0}, 0x1, 0xc000165380)
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x1e
created by github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).Run in goroutine 45
/go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:183 +0x1e5
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x148 pc=0x1ceda32]
I was able to reproduce it with commit 0311f03 .
It seems that we can only reproduce itermittently though. It took me 2 tries to reproduce the panic.
Same steps
LGTM pending QE verification
I've tested the previous commit 8ecaa8d . The panic error is not in the machine-config-controller pod anymore, but we can find a panic in the machine-os-builder pod.
This is the panic logs
E0610 14:17:43.884756 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 92 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1fd8000?, 0x3962a90}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000489ea0?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x1fd8000?, 0x3962a90?}) /usr/lib/golang/src/runtime/panic.go:914 +0x21f github.com/openshift/machine-config-operator/pkg/controller/common.(*MachineOSBuildState).IsBuilding(...) /go/src/github.com/openshift/machine-config-operator/pkg/controller/common/mos_state.go:71 github.com/openshift/machine-config-operator/pkg/controller/build.(*Controller).customBuildPodUpdater(0xc0007adea0, 0xc0008c4000) /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/build_controller.go:430 +0x4b2 github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).syncPod(0xc00037fe30, {0xc000591980, 0x57}) /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:159 +0x610 github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).processNextWorkItem(0xc00037fe30) /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:332 +0xc4 github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).worker(...) /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:321 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x27024a0, 0xc0007811d0}, 0x1, 0xc000165380) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x1e created by github.com/openshift/machine-config-operator/pkg/controller/build.(*PodBuildController).Run in goroutine 45 /go/src/github.com/openshift/machine-config-operator/pkg/controller/build/pod_build_controller.go:183 +0x1e5 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x148 pc=0x1ceda32]
I was able to reproduce it with commit 0311f03 .
It seems that we can only reproduce itermittently though. It took me 2 tries to reproduce the panic.
Same steps
- Create infra pool
- Create MOSC for infra pool
- Wait for build pod to run
- Remove MSOC
- A panic happens in machine-os-builder pod (instead of MCC).
I think I figured out where went wrong here. So two panic cases comes from: (1) MOSC got deleted during build, de-reference the MOSC to find the MOSB for updateMachineOSBuild panics (2) MOSC got deleted during build and MOSB got garbage collected, de-reference the MOSB to read build status panics The reason behind why MOSC is deleted in the middle if not manual and intentional needs further investigation.
Current behaviour:
/retest-required
Verified using IPI on AWS
No panic happened in the controller pod or the machine-os-builder pod.
We can see this log in the machine-os-builder pod instead of the panic
I0611 10:21:16.101690 1 pod_build_controller.go:296] Error syncing pod openshift-machine-config-operator/build-rendered-infra-ad00d24655a1d28893279f65a6b807f1: unable to update with build pod status: Missing MOSC/MOSB for pool infra
/label qe-approved
Thanks Ines for working on the fix and Sergio for verifying the fix! /lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: inesqyx, sinnykumari
The full list of commands accepted by this bot can be found here.
The pull request process is described here
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: inesqyx, sinnykumari
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@inesqyx: Jira Issue OCPBUGS-33129: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-33129 has been moved to the MODIFIED state.
@inesqyx: The following test failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/e2e-aws-ovn-upgrade-out-of-change | 1f53051a2c28f7d4cbdefeae20825dab938bd474 | link | false | /test e2e-aws-ovn-upgrade-out-of-change |
Full PR test history. Your PR dashboard.
[ART PR BUILD NOTIFIER]
This PR has been included in build ose-machine-config-operator-container-v4.17.0-202406111641.p0.g0ec5cd1.assembly.stream.el9 for distgit ose-machine-config-operator. All builds following this will include this PR.
/cherry-pick release-4.16
@inesqyx: new pull request created: #4403
Avoid non-existing MOSC de-reference.