heyfey / vodascheduler

GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)
Apache License 2.0
31 stars 3 forks source link

Feat: discover nodes and number of GPUs in a cluster #3

Closed bsraya closed 2 years ago

bsraya commented 2 years ago

Discover nodes and numbers of gpu on runtime.

Changes made in:

  1. pkg/placement/placement_manager.go
  2. pkg/scheduler/scheduler.go
heyfey commented 2 years ago

Also, I think we need to modify ClusterRole in deploy/vodascheduler/vodascheduler.yaml to give vodascheduler the permission to list nodes

bsraya commented 2 years ago

Grant vodascheduler a permission to list nodes

heyfey commented 2 years ago

LGTM

Can you log about GPUAvailable in scheduler and nodeStates in placement manager? You can add klog.InfoS(...) in Run() and have the log right after "Starting scheduler", ... or "Starting placement manager", ...

bsraya commented 2 years ago

Add klog.InfoS(...)