Closed xmolitann closed 7 months ago
cc @mimowo @trasc for ideas
Can you tell me how many characters did the job name have? I wonder if it's something related to that.
EDIT:
I suppose it's ml-job-annotator-gt-webface-gpu
, which is 31 characters.
This is pretty low, compared to the size limit of 63 for CRD names.
ml-job-annotator-gt-webface-gpu
which is 31
Also, it might be helpful if you could show as the output for
kubectl describe workloads/<workload_name>
and kubectl describe jobs/<job_name>
EDIT; as this commands will also give us events
Also, it might be helpful if you could show as the output for
kubectl describe workloads/<workload_name>
andkubectl describe jobs/<job_name>
Maybe the full yaml of the workload is better
Maybe the full yaml of the workload is better
this can be handy as well kubectl get workloads/<workload_name> -oyaml
Heres the workload, but unfortunately, I no longer have the job and our cluster is fully booked atm. :(
apiVersion: v1
items:
- apiVersion: kueue.x-k8s.io/v1beta1
kind: Workload
metadata:
creationTimestamp: "2024-02-11T10:42:40Z"
generation: 3
labels:
kueue.x-k8s.io/job-uid: a6f3d317-9ab2-44ee-917e-690fd4d29668
name: job-ml-job-annotator-gt-webface-gpu-ff6d8
namespace: pgrunt
resourceVersion: "336381385"
uid: feb1af8b-abe6-4b35-a6f2-29a808a3416f
spec:
active: true
podSets:
- count: 1
name: main
template:
metadata:
name: ml-job-annotator-gt-webface-gpu
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- hawking
containers:
- command:
- bash
- -c
- |
export PATH="/opt/venv/bin/:${PATH}" &&
cd /data/workspace/facekit &&
pip install -e . &&
facekit face_annotator multiface ${WORKLOAD_ARGS} --processes 20
env:
- name: GIT_REPO
value: "redacted"
- name: GIT_CHECKOUT_BRANCH
value: "redacted"
- name: WORKLOAD_ARGS
value: ' --input-path /mnt/data/downloader/wikidata/crawled_identities/gt_images.h5
--output-path /mnt/data/downloader/wikidata/crawled_identities/gt_images_annotated_webface.h5 '
- name: OMP_NUM_THREADS
value: "1"
- name: NUMEXPR_NUM_THREADS
value: "1"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
key: AWS_ACCESS_KEY_ID
name: user-credentials
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
key: AWS_SECRET_ACCESS_KEY
name: user-credentials
- name: CLEARML_API_ACCESS_KEY
valueFrom:
secretKeyRef:
key: CLEARML_API_ACCESS_KEY
name: user-credentials
- name: CLEARML_API_SECRET_KEY
valueFrom:
secretKeyRef:
key: CLEARML_API_SECRET_KEY
name: user-credentials
image: "redacted"
imagePullPolicy: Always
name: ml-main
resources:
limits:
cpu: "20"
ephemeral-storage: 12Gi
memory: 80Gi
nvidia.com/gpu: "2"
requests:
cpu: "10"
ephemeral-storage: 12Gi
memory: 45Gi
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /data/user
name: user-configuration
- mountPath: /data/workspace
name: workspace
- mountPath: /srv/dvc_cache/
name: dvc-cache
- mountPath: /mnt/nas.brno/
name: nas-brno
- mountPath: /dev/shm
name: dev-shm
- mountPath: /mnt/data/downloader
name: downloader
- mountPath: /encrypted/INC-12499
name: wiki-hdd
workingDir: /data/workspace
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: registry-credentials
initContainers:
- command:
- bash
- -c
- |
if [[ ! -d "/root/.ssh" ]]; then
mkdir ~/.ssh &&
mkdir ~/.kube &&
cp /data/user/id_* ~/.ssh &&
cp /data/user/.git* ~/ &&
cp /data/user/kubeconfig ~/.kube/config &&
ssh-keyscan -p 7999 -t rsa "redacted" > ~/.ssh/known_hosts &&
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys &&
chmod 600 ~/.ssh/id_rsa &&
git clone $GIT_REPO &&
cd "$(basename "$GIT_REPO" .git)" &&
git checkout $GIT_CHECKOUT_BRANCH
fi
env:
- name: GIT_REPO
value: "redacted"
- name: GIT_CHECKOUT_BRANCH
value: "redacted"
image: bitnami/git:latest
imagePullPolicy: Always
name: init-git-repo
resources:
limits:
cpu: 250m
memory: 512Mi
requests:
cpu: 100m
memory: 512Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /data/user
name: user-configuration
- mountPath: /data/workspace
name: workspace
workingDir: /data/workspace
priorityClassName: ml-low
restartPolicy: Never
runtimeClassName: nvidia
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- name: user-configuration
projected:
defaultMode: 420
sources:
- secret:
items:
- key: id_rsa
path: id_rsa
- key: id_rsa.pub
path: id_rsa.pub
- key: .git-credentials
path: .git-credentials
- key: kubeconfig
path: kubeconfig
name: user-credentials
- configMap:
items:
- key: .gitconfig
path: .gitconfig
name: gitconfig
- configMap:
items:
- key: clearml.conf
path: clearml.conf
name: clearml.conf
- hostPath:
path: /srv/dvc_cache/
type: ""
name: dvc-cache
- emptyDir: {}
name: workspace
- hostPath:
path: /mnt/nas.brno
type: ""
name: nas-brno
- configMap:
defaultMode: 420
name: iengine.lic
name: iengine-lic
- emptyDir:
medium: Memory
sizeLimit: 1Gi
name: dev-shm
- hostPath:
path: /mnt/data/downloader
type: ""
name: downloader
- hostPath:
path: /encrypted/INC-12499
type: ""
name: wiki-hdd
priority: 10
priorityClassName: ml-low
priorityClassSource: kueue.x-k8s.io/workloadpriorityclass
queueName: lq-all-resources
status:
admission:
clusterQueue: cq-all-resources
podSetAssignments:
- count: 1
flavors:
cpu: default-flavor
ephemeral-storage: default-flavor
memory: default-flavor
nvidia.com/gpu: gpu-a40
pods: default-flavor
name: main
resourceUsage:
cpu: "10"
ephemeral-storage: 12Gi
memory: 45Gi
nvidia.com/gpu: "2"
pods: "1"
conditions:
- lastTransitionTime: "2024-02-11T10:42:40Z"
message: Quota reserved in ClusterQueue cq-all-resources
reason: QuotaReserved
status: "True"
type: QuotaReserved
- lastTransitionTime: "2024-02-11T10:42:40Z"
message: The workload is admitted
reason: Admitted
status: "True"
type: Admitted
kind: List
metadata:
resourceVersion: ""
The pod template in the workload differs from the pod template for the job, for example in this line facekit face_annotator multiface ${WORKLOAD_ARGS} --processes
(20 vs 30).
But Kueue should be deleting those unmatching Workloads and create a new one. There is no deletion timestamp in the shared yaml (assuming it's the latest state).
The Workload does not have a finalizer, meaning that Kueue removed it, but it somehow didn't call Delete? Or it did and it failed?
Do you see any log lines like: deleting not matching workload
or Deleted not matching workload
?
The pod template in the workload differs from the pod template for the job, for example in this line
facekit face_annotator multiface ${WORKLOAD_ARGS} --processes
(20 vs 30).
Sorry, there were some edits afterwards from the developer, but nothing major shouldn't have changed
The Workload does not have a finalizer, meaning that Kueue removed it, but it somehow didn't call Delete? Or it did and it failed?
Do you see any log lines like:
deleting not matching workload
orDeleted not matching workload
?
No logs containing not matching
Indeed the workload is not matching the job and it should be deleted but is not since it has no owner ref pointing to the job... , maybe the workload is left over from an older job with the same name potentially deleted wit --cascade=orphan
.
I will reply with fresh job+workload definitions, so they are 100% matching. Sorry if I caused any confusion so far.
Unfortunately, if you don't specify a cascade option for Job, orphan
is the default :(
Which could lead to this situation.
I wonder if we should just delete orphan Workloads... or is it user error to orphan a Workload and we shouldn't remove the object because the user has not chosen to delete it (by not using a different cascade).
But let's leave this investigation for now.
/triage needs-information
Unfortunately, if you don't specify a cascade option for Job,
orphan
is the default :(
I think it should be background.
Job is the only k8s API that uses orphan as default, instead of background... I know, pretty bad, but we can't change it, because of backwards compatibility.
kubectl delete
uses background
by default. How did you (or your user) delete the old Job?
via k9s
ctrl+d
which defaults to Background
. I will post the new job+workload tomorrow, sorry for the delay.
I will reply with fresh job+workload definitions, so they are 100% matching. Sorry if I caused any confusion so far.
Thanks, this will be great. IIUC the current hypothesis is that the job got deleted, the workload stayed, and the job was recreated by the user with a slightly different pod spec, causing the issue. It would be good to confirm the scenario.
I got slack confirmation from @trasc that deleting a job with --cascade=orphan
and recreating (I assume with slightly different spec) leads to the observed behavior. In that case I'm wondering if as a fix we could detect this situation (the target workload has a non-existing job owner) and recreate the workload.
@xmolitann Could you confirm if that happened? And if so, what was the motivation to use --cascade=orphan
? We are not sure what the correct behavior should be. Options are:
How about not inheriting but recreating the workload, so that there is still maintain equivalence between the job and the workload
I got slack confirmation from @trasc that deleting a job with
--cascade=orphan
and recreating (I assume with slightly different spec) leads to the observed behavior. In that case I'm wondering if as a fix we could detect this situation (the target workload has a non-existing job owner) and recreate the workload.
I can confirm this, because when I delete the job with background
, the workloads gets deleted as well, I can then re-submit the jobs and pod gets admitted. When I delete it with orphan
, what you are mentioning happens. So this is rather user error than bug, sorry for confusion. What to do with this behavior I will leave up to you. I guess more verbose error message would be fine.
In case I wasn't clear I think the issue is rather narrow scope, as this is the following scenario:
cascade=orphan
My proposal is to fix it in a least intrusive manner: when we get "Already exists error" (step 4.), then fetch the workload, and verify if the workload is orphaned. If the workload is orphaned, then delete it, which will allow the new workload to be recreated in its place. This does not entail inheriting an existing workload necessarily. As a performance optimization we could inherit such an orphan workload, but I think only if we detect the job template is still equivalent with the workload.
In my opinion a workload should be dedicated to a job (k8s object). By deleting and creating a new job with the same name you get another object.
I kind of like the idea of just using the UID. However @trasc can you verify what happens when you upgrade kueue? Do we make assumptions about the workload name?
The other point is that if the orphan Workload is not deleted and is not finished, it's consuming quota on the kueue. So maybe that's an argument for garbage collecting it. Or at least marking it Finished.
By deleting and creating a new job with the same name you get another object.
The proposal https://github.com/kubernetes-sigs/kueue/issues/1726#issuecomment-1944206289 is compatible with this. We don't reuse the workload, but recreate.
I agree that it's compatible, but it's more steps.
I'm also hesitant about using UIDs because it makes the workload names virtually random. Having predictable workload names is a nice feature which can be useful for some users, even for us when documenting features.
I agree that it's compatible, but it's more steps.
Yes, but it is more steps only in this situation, which should be rare anyway.
The other point is that if the orphan Workload is not deleted and is not finished, it's consuming quota on the kueue. So maybe that's an argument for garbage collecting it. Or at least marking it Finished.
Yes, I like the idea of marking orphan workloads as finished (or even deleting), by some form of garbage collection, The fix for leftiever orphan workloads is needed regardless if we go with UIDs or workload recreation solutions.
In my opinion the fist step is
then add a garbage collection scheme.
I'm also hesitant about using UIDs because it makes the workload names virtually random
We already add a hash at the end of the name, so it's not immediately obvious how to get from the job name to the workload name.
We already add a hash at the end of the name, so it's not immediately obvious how to get from the job name to the workload name.
Right, but if you repeat some script multiple times it is enough to check once the workload name. For example in Kueue periodic tests we can find the same workload names when comparing different runs.
I think the following concerns are still valid:
Thus I think it is reasonable to rethink if there are alternatives. I admit in the other approach we need 3 requests (one to detect conflict, one to fetch the new workload, one to check there is no owner reference in the workload), so not ideal either.
How about just deleting a workload from workload_controller
in reaction to removed metadata.ownerReference
(by kube garbage-collector) which removes it in the "orphan" mode? I think this would also solve the issue. One downside I see is that we would still delete even though "--cascade=orphan" is used.
this will cause all running jobs to stop as the workloads will not match during upgrade, all pending jobs are requeued
Actually, I'm not sure about this (didn't test). Now I think there actually may not be impact here, because the workload with the old scheme can still support the job, based on the ownerReference. Sorry for confusing, in that case the remaining two points aren't that relevant probably.
EDIT: I now tested by changing the workload name scheme, and upgraded in-place. The old job continues to run. Once again sorry for confusing. Given that (2.) and (3.) and not big concerns, I ok with https://github.com/kubernetes-sigs/kueue/issues/1726#issuecomment-1944319301.
However @trasc can you verify what happens when you upgrade kueue? Do we make assumptions about the workload name?
@trasc This is my main concern against generating a hash with UID. If the workload is recreated and the readmission or termination against existing Jobs happens, including uid in has would be barriers for updating the kueue version.
However @trasc can you verify what happens when you upgrade kueue? Do we make assumptions about the workload name?
@trasc This is my main concern against generating a hash with UID. If the workload is recreated and the readmission or termination against existing Jobs happens, including uid in has would be barriers for updating the kueue version.
The name is only used when creating the workload, after that we only use the owner references to "connect" the wl to its job. In case of kueue upgrade, the old wl will be paired with the jobs based on owner refs (regardless of names), newer workloads are created using the new naming method.
However @trasc can you verify what happens when you upgrade kueue? Do we make assumptions about the workload name?
@trasc This is my main concern against generating a hash with UID. If the workload is recreated and the readmission or termination against existing Jobs happens, including uid in has would be barriers for updating the kueue version.
The name is only used when creating the workload, after that we only use the owner references to "connect" the wl to its job. In case of kueue upgrade, the old wl will be paired with the jobs based on owner refs (regardless of names), newer workloads are created using the new naming method.
That makes sense. I'm fine with including a UID in the hash.
What happened:
Job submitted, workload got created and admitted that job. Job was never un-suspended. I tried to patch Job to
suspend: false
, but it immediately got suspended again.There are multiple identical errors from kueue:
Cluster, local queue
Job.yaml
The Worklaod object has status:
What you expected to happen:
Job gets un-suspended, Pod is created and running.
How to reproduce it (as minimally and precisely as possible):
Create cluster+local queue and try submitting job mentioned above.
Anything else we need to know?:
I guess that's all, let me know if you need some more information and I will happily provide.
Environment:
kubectl version
): Server Version: v1.26.10+rke2r2git describe --tags --dirty --always
): 0.5.2 installed via 0.1.0 Helm chartcat /etc/os-release
): Ubuntu 22.04.3 LTSuname -a
): 5.15.0-79-generic