kubernetes-sigs / jobset

JobSet: a k8s native API for distributed ML training and HPC workloads
https://jobset.sigs.k8s.io/
Apache License 2.0
138 stars 45 forks source link

Add Job name label to pods #578

Closed danielvegamyhre closed 4 months ago

danielvegamyhre commented 4 months ago

What would you like to be added: Add Job name label to pods

Why is this needed: In dynamically provisioned clusters, this would be helpful for debugging which job's leader pod triggered the node pool creation.

googs1025 commented 4 months ago

/assign

danielvegamyhre commented 4 months ago

job-name label is here: https://github.com/kubernetes-sigs/jobset/blob/0b8c19c14159302048b7fd5bf50a1c2d819193e2/api/jobset/v1alpha2/jobset_types.go#L29

danielvegamyhre commented 4 months ago

Thanks for working on this @googs1025!

googs1025 commented 4 months ago

job-name label is here:

https://github.com/kubernetes-sigs/jobset/blob/0b8c19c14159302048b7fd5bf50a1c2d819193e2/api/jobset/v1alpha2/jobset_types.go#L29

Okay, thanks for reminding me of this, I will try it out in the next few days!

googs1025 commented 4 months ago
root@VM-0-17-ubuntu:/home/ubuntu# kubectl describe pods network-jobset-leader-0-0-2qphg
Name:             network-jobset-leader-0-0-2qphg
Namespace:        default
Priority:         0
Service Account:  default
Node:             cluster1-worker2/172.18.0.4
Start Time:       Sat, 25 May 2024 14:44:56 +0800
Labels:           batch.kubernetes.io/controller-uid=88683b1c-7d3b-4108-a28c-a5382e876e01
                  batch.kubernetes.io/job-completion-index=0
                  batch.kubernetes.io/job-name=network-jobset-leader-0
                  controller-uid=88683b1c-7d3b-4108-a28c-a5382e876e01
                  job-name=network-jobset-leader-0
                  jobset.sigs.k8s.io/job-index=0
                  jobset.sigs.k8s.io/job-key=f355a63a96e9dba5c05d688eac605a6c8c1f379f
                  jobset.sigs.k8s.io/jobset-name=network-jobset
                  jobset.sigs.k8s.io/replicatedjob-name=leader
                  jobset.sigs.k8s.io/replicatedjob-replicas=1
                  jobset.sigs.k8s.io/restart-attempt=0
Annotations:      batch.kubernetes.io/job-completion-index: 0
                  jobset.sigs.k8s.io/job-index: 0
                  jobset.sigs.k8s.io/job-key: f355a63a96e9dba5c05d688eac605a6c8c1f379f
                  jobset.sigs.k8s.io/jobset-name: network-jobset
                  jobset.sigs.k8s.io/replicatedjob-name: leader
                  jobset.sigs.k8s.io/replicatedjob-replicas: 1
                  jobset.sigs.k8s.io/restart-attempt: 0
Status:           Running
IP:               10.6.2.19
IPs:

...

root@VM-0-17-ubuntu:/home/ubuntu# kubectl describe pods pi-26c2f
Name:             pi-26c2f
Namespace:        default
Priority:         0
Service Account:  default
Node:             cluster1-worker2/172.18.0.4
Start Time:       Sat, 25 May 2024 12:56:50 +0800
Labels:           app=my-app
                  batch.kubernetes.io/controller-uid=ce3589d6-a8fe-4975-a292-e369cbc48b88
                  batch.kubernetes.io/job-name=pi
                  controller-uid=ce3589d6-a8fe-4975-a292-e369cbc48b88
                  job-name=pi
...

I tried to launch a jobset or create pods in the cluster, and I noticed that there is already a 'job-name' label field present

danielvegamyhre commented 4 months ago

Oh I forgot the Job controller already adds this, nevermind we can close this.