Closed kangzemin closed 1 month ago
@kangzemin The GPU resource information depends on metrics in Prometheus, requiring a Prometheus service in the cluster and providing metrics such as "nvidia_gpu_duty_cycle." For reference, see: https://github.com/kubeflow/arena/blob/master/pkg/apis/types/gpu_metric.go
Thank you for your guidance!
When I deploy exporter with https://github.com/kubeflow/arena/blob/master/docs/top/prometheus.md, there has been an error:
so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate)
now pod is running.
But exporter pod log is:
time="2024-05-11T03:28:50Z" level=info msg="runtime is docker"
{"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-11T03:28:50Z"}
Is there something wrong?
kubernests version:
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-02T00:35:13Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-01T01:11:45Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
attention: my kubernetes runtime is containerd. nvidia-smi:
NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2
Thank you for your guidance! When I deploy exporter with https://github.com/kubeflow/arena/blob/master/docs/top/prometheus.md, there has been an error:
so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate) now pod is running.
But exporter pod log is:
time="2024-05-11T03:28:50Z" level=info msg="runtime is docker" {"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-11T03:28:50Z"}
Is there something wrong?
kubernests version:
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-02T00:35:13Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-01T01:11:45Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
attention: my kubernetes runtime is containerd. nvidia-smi:
NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2
@Syulin7
so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate)
@kangzemin You need to mount the node's containerd.sock to /run/containerd/containerd.sock inside the container.
so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate)
@kangzemin You need to mount the node's containerd.sock to /run/containerd/containerd.sock inside the container.
@Syulin7 ok,I mount nodes‘s /run/containerd/containerd.sock to /run/containerd/containerd.sock inside the container. and exporter pod is running.
but exporter pod log is error:
time="2024-05-14T06:36:07Z" level=info msg="runtime is containerd"
{"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-14T06:36:07Z"}
query from prometheus is empty:
kubectl get --raw '/api/v1/namespaces/arena-system/services/prometheus-svc:prometheus/proxy/api/v1/query?query=nvidia_gpu_num_devices'
{"status":"success","data":{"resultType":"vector","result":[]}}
@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.
kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'
@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.
kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'
This is result :
@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.
kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'
This is result :
@Syulin7
https://github.com/kubeflow/arena/pull/1087
@kangzemin I submitted a PR to fix this issue. Please refer to this PR to redeploy the service.
The Prometheus deployed here is for testing only. In a production environment, you should deploy your own Prometheus service and ensure data persistence.
The Prometheus deployed here is for testing only. In a production environment, you should deploy your own Prometheus service and ensure data persistence. @Syulin7 Ok,Thank you !
@kangzemin Does it work after trying again? Are there any other issues?
@kangzemin Does it work after trying again? Are there any other issues?
The problem still exists。
But prometheus looks normal
kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'
grafana only gpunode dashboard has data, other is empty.
Can you give me some advice?
@kangzemin It seems that the metrics collected by node-gpu-exporter do not include the pod_name. Have you updated the node-gpu-exporter image and modified the resource limit value according to https://github.com/kubeflow/arena/pull/1087?
It seems that the metrics collected by node-gpu-exporter do not include the pod_name. Have you updated the node-gpu-exporter image and modified the resource limit value according to #1087?
@Syulin7 Yes, I fix deployment, use image:gpu-prometheus-exporter:v1.0.1-b2c2f9b. and limit cpu 1, mem 2000Mi. arena top job, about gpu info is N/A .
@kangzemin This should be related to your cluster configuration, please contact me via email.
arena top job lost resourece information![image](https://github.com/kubeflow/arena/assets/40269690/54018c7e-2bf0-4c68-897c-94655d89c21d)
arena: v0.9.14 BuildDate: 2024-04-10T12:54:22Z GitCommit: adb43b8d7490adc613f3d0762ffe9a8ee9f10552 GitTreeState: clean GitTag: v0.9.14 GoVersion: go1.20.12 Compiler: gc Platform: linux/amd64