koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.36k stars 331 forks source link

[BUG] DCGM Metrics Not Supported When Use Koordinator GPU Allocate Logic #2171

Open ZiMengSheng opened 3 months ago

ZiMengSheng commented 3 months ago

What happened:

DCGM 采用 PodResources 接口暴露 Pod 的 GPU 指标,这依赖 kubelet 的 GPU 分配结果,但是 Koordinator 的 GPU 分配结果是调度器分配的,因此 DCGM 这里会有问题。

What you expected to happen:

用户能够通过某种方式看到和 dcgm 一样的指标

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

stale[bot] commented 10 hours ago

This issue has been automatically marked as stale because it has not had recent activity. This bot triages issues and PRs according to the following rules: