Open bilbilmyc opened 8 months ago
I enter the POD and execute the command
kubectl exec -it gpu-exporter-nvidia-gpu-exporter-2lh74 -n mayunchao bash
root@gpu-exporter-nvidia-gpu-exporter-2lh74:/# ll /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
-rwxr-xr-x 1 root root 1784524 Nov 4 01:14 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so*
root@gpu-exporter-nvidia-gpu-exporter-2lh74:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
Describe the bug
I used helm to install nvidia_gpu_exporter. I only changed values.yml. pod is running normally, but pod is reporting errors. error="command failed. stderr: err: exit status 12"
To Reproduce Steps to reproduce the behavior: https://artifacthub.io/packages/helm/utkuozdemir/nvidia-gpu-exporter
imagePullSecrets: [] nameOverride: "" fullnameOverride: ""
serviceAccount: create: true annotations: {} name: ""
podAnnotations: {}
podSecurityContext: {}
securityContext: privileged: true
service: type: NodePort port: 9835 nodePort: 30235
ingress: enabled: false className: "" annotations: {} hosts:
resources: {}
nodeSelector: {}
tolerations: {}
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:
port: 9835
hostPort: enabled: false port: 9835
log: level: info format: logfmt
queryFieldNames:
nvidiaSmiCommand: nvidia-smi
telemetryPath: /metrics
volumes:
volumeMounts:
serviceMonitor: enabled: false additionalLabels: {} scheme: http bearerTokenFile: interval: tlsConfig: {} proxyUrl: "" relabelings: [] metricRelabelings: [] scrapeTimeout: 10s
Expected behavior
I expect the POD to run properly and collect data
Console output
Model and Version
NVIDIA GeForce RTX 4090
]appVersion: 0.3.0, helm chart
]helm
]CentOS Linux release 7.9.2009 (Core)
, ]NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1
]Additional context Add any other context about the problem here.