openshift / instaslice-operator

InstaSlice Operator facilitates slicing of accelerators using stable APIs
Apache License 2.0
15 stars 12 forks source link

Fix the daemonset crashing #205

Closed sairameshv closed 3 weeks ago

sairameshv commented 3 weeks ago

I observed the following logs when tried to run the e2e tests without emulator mode. Added a fix to resolve the panic.

{"level":"info","ts":"2024-10-30T14:55:25.124049928Z","caller":"controller/instaslice_daemonset.go:437","msg":"classical resources obtained are ","cpu":16,"memory":65971478528}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x130 pc=0x19625f2]

goroutine 38 [running]:
github.com/openshift/instaslice-operator/internal/controller.(*InstaSliceDaemonsetReconciler).discoverMigEnabledGpuWithSlices(0xc00067a000)
    /workspace/internal/controller/instaslice_daemonset.go:444 +0x232
github.com/openshift/instaslice-operator/internal/controller.(*InstaSliceDaemonsetReconciler).SetupWithManager.func1({0x210e2d0, 0xc000159590})
    /workspace/internal/controller/instaslice_daemonset.go:361 +0x1d3
sigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start(0x210e2d0?, {0x210e2d0?, 0xc000159590?})
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/manager/manager.go:307 +0x26
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc000631760)
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/manager/runnable_group.go:226 +0xc8
created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 89
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/manager/runnable_group.go:210 +0x19d

/cc @asm582

sairameshv commented 3 weeks ago

/cc @harche

openshift-ci[bot] commented 3 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sairameshv

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/instaslice-operator/blob/main/OWNERS)~~ [sairameshv] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
sairameshv commented 3 weeks ago

The daemonset seems to crash even after this fix with the following log snippet but I see that could be an expected one as I don't have a real GPU and setup installed.

{"level":"info","ts":"2024-10-30T15:07:35.733845355Z","caller":"controller/instaslice_daemonset.go:151","msg":"creating allocation for ","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","pod":"vectoradd-finalizer"}
{"level":"error","ts":"2024-10-30T15:07:35.735201984Z","caller":"controller/instaslice_daemonset.go:166","msg":"Unable to initialize NVML","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","error":"ERROR_LIBRARY_NOT_FOUND","stacktrace":"github.com/openshift/instaslice-operator/internal/controller.(*InstaSliceDaemonsetReconciler).Reconcile\n\t/workspace/internal/controller/instaslice_daemonset.go:166\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"}
{"level":"error","ts":"2024-10-30T15:07:35.735422146Z","caller":"controller/instaslice_daemonset.go:179","msg":"Unable to get device count","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","error":"ERROR_LIBRARY_NOT_FOUND","stacktrace":"github.com/openshift/instaslice-operator/internal/controller.(*InstaSliceDaemonsetReconciler).Reconcile\n\t/workspace/internal/controller/instaslice_daemonset.go:179\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"}
/daemonset: symbol lookup error: /daemonset: undefined symbol: nvmlDeviceGetHandleByUUID
harche commented 3 weeks ago

The daemonset seems to crash even after this fix with the following log snippet but I see that could be an expected one as I don't have a real GPU and setup installed.

{"level":"info","ts":"2024-10-30T15:07:35.733845355Z","caller":"controller/instaslice_daemonset.go:151","msg":"creating allocation for ","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","pod":"vectoradd-finalizer"}
{"level":"error","ts":"2024-10-30T15:07:35.735201984Z","caller":"controller/instaslice_daemonset.go:166","msg":"Unable to initialize NVML","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","error":"ERROR_LIBRARY_NOT_FOUND","stacktrace":"github.com/openshift/instaslice-operator/internal/controller.(*InstaSliceDaemonsetReconciler).Reconcile\n\t/workspace/internal/controller/instaslice_daemonset.go:166\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"}
{"level":"error","ts":"2024-10-30T15:07:35.735422146Z","caller":"controller/instaslice_daemonset.go:179","msg":"Unable to get device count","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","error":"ERROR_LIBRARY_NOT_FOUND","stacktrace":"github.com/openshift/instaslice-operator/internal/controller.(*InstaSliceDaemonsetReconciler).Reconcile\n\t/workspace/internal/controller/instaslice_daemonset.go:179\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"}
/daemonset: symbol lookup error: /daemonset: undefined symbol: nvmlDeviceGetHandleByUUID

It should handle such scenarios gracefully. Under no circumstances is a crash justified.

asm582 commented 3 weeks ago

@sairameshv do we know why emulator mode is doing a gpu discovery call?

sairameshv commented 3 weeks ago

@sairameshv do we know why emulator mode is doing a gpu discovery call?

@asm582 , This is not an emulator mode. I tried out running various targets while adding code changes as part of this PR and suddenly encountered this panic.

I know there is no GPU to run the tests but I thought that scenario should be handled and so I thought of raising a PR.

rphillips commented 3 weeks ago

undefined symbol: nvmlDeviceGetHandleByUUID seems like a compilation or driver issue. Which image are you using?

asm582 commented 3 weeks ago

@sairameshv do we know why emulator mode is doing a gpu discovery call?

@asm582 , This is not an emulator mode. I tried out running various targets while adding code changes as part of this PR and suddenly encountered this panic.

I know there is no GPU to run the tests but I thought that scenario should be handled and so I thought of raising a PR.

I understand you don't have MIG GPUs in your cluster. I don't think running GPU mode without any GPUs helps us, the code is bound to fail

openshift-ci[bot] commented 3 weeks ago

@rphillips: /override requires failed status contexts, check run or a prowjob name to operate on. The following unknown contexts/checkruns were given:

Only the following failed contexts/checkruns were expected:

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

In response to [this](https://github.com/openshift/instaslice-operator/pull/205#issuecomment-2447931444): >/override Red Hat Konflux / dynamicacceleratorslicer-enterprise-contract / pr group panic-fix >/override Red Hat Konflux / Red Hat Konflux / dynamicacceleratorslicer-enterprise-contract / instaslice-operator-daemonset >/override Red Hat Konflux / Red Hat Konflux / dynamicacceleratorslicer-enterprise-contract / instaslice-operator-bundle >/override Red Hat Konflux / Red Hat Konflux / dynamicacceleratorslicer-enterprise-contract / instaslice-operator Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
asm582 commented 3 weeks ago

/lgtm

openshift-ci[bot] commented 3 weeks ago

@sairameshv: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).