Closed sairameshv closed 3 weeks ago
/cc @harche
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: sairameshv
The full list of commands accepted by this bot can be found here.
The pull request process is described here
The daemonset seems to crash even after this fix with the following log snippet but I see that could be an expected one as I don't have a real GPU and setup installed.
{"level":"info","ts":"2024-10-30T15:07:35.733845355Z","caller":"controller/instaslice_daemonset.go:151","msg":"creating allocation for ","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","pod":"vectoradd-finalizer"}
{"level":"error","ts":"2024-10-30T15:07:35.735201984Z","caller":"controller/instaslice_daemonset.go:166","msg":"Unable to initialize NVML","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","error":"ERROR_LIBRARY_NOT_FOUND","stacktrace":"github.com/openshift/instaslice-operator/internal/controller.(*InstaSliceDaemonsetReconciler).Reconcile\n\t/workspace/internal/controller/instaslice_daemonset.go:166\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"}
{"level":"error","ts":"2024-10-30T15:07:35.735422146Z","caller":"controller/instaslice_daemonset.go:179","msg":"Unable to get device count","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","error":"ERROR_LIBRARY_NOT_FOUND","stacktrace":"github.com/openshift/instaslice-operator/internal/controller.(*InstaSliceDaemonsetReconciler).Reconcile\n\t/workspace/internal/controller/instaslice_daemonset.go:179\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"}
/daemonset: symbol lookup error: /daemonset: undefined symbol: nvmlDeviceGetHandleByUUID
The daemonset seems to crash even after this fix with the following log snippet but I see that could be an expected one as I don't have a real GPU and setup installed.
{"level":"info","ts":"2024-10-30T15:07:35.733845355Z","caller":"controller/instaslice_daemonset.go:151","msg":"creating allocation for ","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","pod":"vectoradd-finalizer"} {"level":"error","ts":"2024-10-30T15:07:35.735201984Z","caller":"controller/instaslice_daemonset.go:166","msg":"Unable to initialize NVML","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","error":"ERROR_LIBRARY_NOT_FOUND","stacktrace":"github.com/openshift/instaslice-operator/internal/controller.(*InstaSliceDaemonsetReconciler).Reconcile\n\t/workspace/internal/controller/instaslice_daemonset.go:166\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"} {"level":"error","ts":"2024-10-30T15:07:35.735422146Z","caller":"controller/instaslice_daemonset.go:179","msg":"Unable to get device count","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"kind-control-plane","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"kind-control-plane","reconcileID":"ab775af5-e2e8-4e63-badf-740e95a48c0d","error":"ERROR_LIBRARY_NOT_FOUND","stacktrace":"github.com/openshift/instaslice-operator/internal/controller.(*InstaSliceDaemonsetReconciler).Reconcile\n\t/workspace/internal/controller/instaslice_daemonset.go:179\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224"} /daemonset: symbol lookup error: /daemonset: undefined symbol: nvmlDeviceGetHandleByUUID
It should handle such scenarios gracefully. Under no circumstances is a crash justified.
@sairameshv do we know why emulator mode is doing a gpu discovery call?
@sairameshv do we know why emulator mode is doing a gpu discovery call?
@asm582 , This is not an emulator mode. I tried out running various targets while adding code changes as part of this PR and suddenly encountered this panic.
I know there is no GPU to run the tests but I thought that scenario should be handled and so I thought of raising a PR.
undefined symbol: nvmlDeviceGetHandleByUUID
seems like a compilation or driver issue. Which image are you using?
@sairameshv do we know why emulator mode is doing a gpu discovery call?
@asm582 , This is not an emulator mode. I tried out running various targets while adding code changes as part of this PR and suddenly encountered this panic.
I know there is no GPU to run the tests but I thought that scenario should be handled and so I thought of raising a PR.
I understand you don't have MIG GPUs in your cluster. I don't think running GPU mode without any GPUs helps us, the code is bound to fail
@rphillips: /override requires failed status contexts, check run or a prowjob name to operate on. The following unknown contexts/checkruns were given:
/
Hat
Konflux
Red
dynamicacceleratorslicer-enterprise-contract
group
instaslice-operator
instaslice-operator-bundle
instaslice-operator-daemonset
panic-fix
pr
Only the following failed contexts/checkruns were expected:
Red Hat Konflux / dynamicacceleratorslicer-enterprise-contract / instaslice-operator
Red Hat Konflux / dynamicacceleratorslicer-enterprise-contract / instaslice-operator-bundle
Red Hat Konflux / dynamicacceleratorslicer-enterprise-contract / instaslice-operator-daemonset
Red Hat Konflux / dynamicacceleratorslicer-enterprise-contract / pr group panic-fix
ci/prow/lint
ci/prow/unit
pull-ci-openshift-instaslice-operator-main-lint
pull-ci-openshift-instaslice-operator-main-unit
tide
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.
/lgtm
@sairameshv: all tests passed!
Full PR test history. Your PR dashboard.
I observed the following logs when tried to run the e2e tests without emulator mode. Added a fix to resolve the panic.
/cc @asm582