kubeflow / arena

A CLI for Kubeflow.
Apache License 2.0
720 stars 176 forks source link

arena top job lost resource information #1082

Closed kangzemin closed 1 month ago

kangzemin commented 2 months ago

arena top job lost resourece information image

arena: v0.9.14 BuildDate: 2024-04-10T12:54:22Z GitCommit: adb43b8d7490adc613f3d0762ffe9a8ee9f10552 GitTreeState: clean GitTag: v0.9.14 GoVersion: go1.20.12 Compiler: gc Platform: linux/amd64

Syulin7 commented 2 months ago

@kangzemin The GPU resource information depends on metrics in Prometheus, requiring a Prometheus service in the cluster and providing metrics such as "nvidia_gpu_duty_cycle." For reference, see: https://github.com/kubeflow/arena/blob/master/pkg/apis/types/gpu_metric.go

kangzemin commented 2 months ago

Thank you for your guidance! When I deploy exporter with https://github.com/kubeflow/arena/blob/master/docs/top/prometheus.md, there has been an error: image so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate) now pod is running.

But exporter pod log is:

time="2024-05-11T03:28:50Z" level=info msg="runtime is docker"
{"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-11T03:28:50Z"}

Is there something wrong?

kubernests version:

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-02T00:35:13Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-01T01:11:45Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

attention: my kubernetes runtime is containerd. nvidia-smi:

 NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2
kangzemin commented 2 months ago

Thank you for your guidance! When I deploy exporter with https://github.com/kubeflow/arena/blob/master/docs/top/prometheus.md, there has been an error: image so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate) now pod is running.

But exporter pod log is:

time="2024-05-11T03:28:50Z" level=info msg="runtime is docker"
{"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-11T03:28:50Z"}

Is there something wrong?

kubernests version:

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-02T00:35:13Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-01T01:11:45Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

attention: my kubernetes runtime is containerd. nvidia-smi:

 NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2

@Syulin7

Syulin7 commented 2 months ago

so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate)

@kangzemin You need to mount the node's containerd.sock to /run/containerd/containerd.sock inside the container.

kangzemin commented 2 months ago

so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate)

@kangzemin You need to mount the node's containerd.sock to /run/containerd/containerd.sock inside the container.

@Syulin7 ok,I mount nodes‘s /run/containerd/containerd.sock to /run/containerd/containerd.sock inside the container. and exporter pod is running.

but exporter pod log is error:

time="2024-05-14T06:36:07Z" level=info msg="runtime is containerd"
{"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-14T06:36:07Z"}

query from prometheus is empty:

kubectl get --raw '/api/v1/namespaces/arena-system/services/prometheus-svc:prometheus/proxy/api/v1/query?query=nvidia_gpu_num_devices' 
{"status":"success","data":{"resultType":"vector","result":[]}}

gpu-exporter

Syulin7 commented 2 months ago

@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.

kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'
kangzemin commented 2 months ago

@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.

kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'

This is result : image

image

image

kangzemin commented 2 months ago

@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.

kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'

This is result : image

image

image

@Syulin7

Syulin7 commented 2 months ago

https://github.com/kubeflow/arena/pull/1087

@kangzemin I submitted a PR to fix this issue. Please refer to this PR to redeploy the service.

The Prometheus deployed here is for testing only. In a production environment, you should deploy your own Prometheus service and ensure data persistence.

kangzemin commented 2 months ago

The Prometheus deployed here is for testing only. In a production environment, you should deploy your own Prometheus service and ensure data persistence. @Syulin7 Ok,Thank you !

Syulin7 commented 2 months ago

@kangzemin Does it work after trying again? Are there any other issues?

kangzemin commented 1 month ago

@kangzemin Does it work after trying again? Are there any other issues?

The problem still exists。 image But prometheus looks normal

kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'

prometheus

grafana only gpunode dashboard has data, other is empty.

Can you give me some advice?

Syulin7 commented 1 month ago

@kangzemin It seems that the metrics collected by node-gpu-exporter do not include the pod_name. Have you updated the node-gpu-exporter image and modified the resource limit value according to https://github.com/kubeflow/arena/pull/1087?

kangzemin commented 1 month ago

It seems that the metrics collected by node-gpu-exporter do not include the pod_name. Have you updated the node-gpu-exporter image and modified the resource limit value according to #1087?

@Syulin7 Yes, I fix deployment, use image:gpu-prometheus-exporter:v1.0.1-b2c2f9b. and limit cpu 1, mem 2000Mi. arena top job, about gpu info is N/A .

Syulin7 commented 1 month ago

@kangzemin This should be related to your cluster configuration, please contact me via email.