volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.06k stars 936 forks source link

volcano vgpu metrics not update properly #3605

Closed archlitchi closed 1 month ago

archlitchi commented 1 month ago

If you submit a vgpu job, you can see the corresponding metrics by using scheduler metrics, as follows:

task yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pod1
spec:
  restartPolicy: OnFailure
  schedulerName: volcano
  containers:
  - image: ubuntu:20.04
    name: pod1-ctr
    command: ["sleep"]
    args: ["100000"]
    resources:
      limits:
        volcano.sh/vgpu-memory: 1024
        volcano.sh/vgpu-number: 1

then by visiting scheduler metrics, you can get the vgpu overview of vc-scheduler

curl {vc-scheduler}:8080/metrics
# HELP volcano_vgpu_device_allocated_cores The percentage of gpu compute cores allocated in this card
# TYPE volcano_vgpu_device_allocated_cores gauge
volcano_vgpu_device_allocated_cores{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 0
volcano_vgpu_device_allocated_cores{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 0
# HELP volcano_vgpu_device_allocated_memory The number of vgpu memory allocated in this card
# TYPE volcano_vgpu_device_allocated_memory gauge
volcano_vgpu_device_allocated_memory{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 0
volcano_vgpu_device_allocated_memory{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 1024
# HELP volcano_vgpu_device_memory_limit The number of total device memory allocated in this card
# TYPE volcano_vgpu_device_memory_limit gauge
volcano_vgpu_device_memory_limit{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 32768
volcano_vgpu_device_memory_limit{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 32768
# HELP volcano_vgpu_device_shared_number The number of vgpu tasks sharing this card
# TYPE volcano_vgpu_device_shared_number gauge
volcano_vgpu_device_shared_number{devID="GPU-00552014-5c87-89ac-b1a6-7b53aa24b0ec"} 0
volcano_vgpu_device_shared_number{devID="GPU-0fc3eda5-e98b-a25b-5b0d-cf5c855d1448"} 1

But these metrics are not cleaned up after the pod ends, these metrics are still there, even if we delete this pod.

Monokaix commented 1 month ago

/good-first-issue

volcano-sh-bot commented 1 month ago

@Monokaix: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/volcano-sh/volcano/issues/3605): >/good-first-issue Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
googs1025 commented 1 month ago

/assign

googs1025 commented 1 month ago

Let me understand the problem first. When our pod is updated, the indicator is not updated accordingly, right? @archlitchi

Monokaix commented 1 month ago

he percentage of gpu compute c

I think the problem is that metric is not updated when pod deleted: ) An core codes in file pkg/scheduler/api/devices/nvidia/vgpu/metrics.go & pkg/scheduler/api/devices/nvidia/vgpu/device_info.go

archlitchi commented 1 month ago

Let me understand the problem first. When our pod is updated, the indicator is not updated accordingly, right? @archlitchi

yes, i will submit a patch to fix that

Monokaix commented 1 month ago

/close

volcano-sh-bot commented 1 month ago

@Monokaix: Closing this issue.

In response to [this](https://github.com/volcano-sh/volcano/issues/3605#issuecomment-2259600689): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.