kubernetes-sigs / kwok

Kubernetes WithOut Kubelet - Simulates thousands of Nodes and Clusters.
https://kwok.sigs.k8s.io
Apache License 2.0
2.38k stars 191 forks source link

delay.durationMilliseconds in pod-complete does not take effect #1146

Open dmitsh opened 1 week ago

dmitsh commented 1 week ago

How to use it?

What happened?

I'm running KWOK v0.5.2 in a cluster. I deployed a set of stages from stage-fast.yaml. Then I updated pod-complete by copying kustomize/stage/pod/general/pod-complete.yaml, setting delay.durationMilliseconds to 10000, and deploying it in the cluster. Then I deploy a job. The job does not have annotation pod-complete.stage.kwok.x-k8s.io/delay When the job starts, the pods are marked Completed right away:

% k get po
NAME           READY   STATUS      RESTARTS   AGE
job1-0-prkbn   0/1     Completed   0          3s
job1-1-zg6tc   0/1     Completed   0          3s

What did you expect to happen?

IIUC, in the absence of pod-complete.stage.kwok.x-k8s.io/delay annotation, the delay.durationMilliseconds specifies how long the pods should run before changing the status to Completed.

How can we reproduce it (as minimally and precisely as possible)?

  1. Deploy kwok and stages in the cluster
    kubectl apply -f https://github.com/kubernetes-sigs/kwok/releases/download/v0.5.2/kwok.yaml
    kubectl apply -f https://github.com/kubernetes-sigs/kwok/releases/download/v0.5.2/stage-fast.yaml
  2. Get a copy of pod-complete.yaml
    wget https://github.com/kubernetes-sigs/kwok/raw/main/kustomize/stage/pod/general/pod-complete.yaml
  3. Set delay.durationMilliseconds to 10000 in pod-complete.yaml
  4. Update the stage
    kubectl apply -f pod-complete.yaml
  5. Create job spec
    cat <<EOF > job1.yaml
    apiVersion: batch/v1
    kind: Job
    metadata:
    labels:
    batch.kubernetes.io/job-name: job1
    job-name: job1
    name: job1
    namespace: default
    spec:
    backoffLimit: 0
    completionMode: Indexed
    completions: 2
    manualSelector: false
    parallelism: 2
    podReplacementPolicy: TerminatingOrFailed
    suspend: false
    template:
    metadata:
      creationTimestamp: null
      labels:
        batch.kubernetes.io/job-name: job1
        job-name: job1
    spec:
      containers:
      - image: ubuntu
        imagePullPolicy: IfNotPresent
        name: test
        resources:
          limits:
            cpu: 100m
            memory: 512M
            nvidia.com/gpu: "8"
          requests:
            cpu: 100m
            memory: 512M
            nvidia.com/gpu: "8"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: OnFailure
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
    EOF
  6. Deploy the job and check pod status
    kubectl apply -f job1.yaml; sleep 1; kubectl get po
  7. Observe that the pods are completed (and not running as I would expect).

Anything else we need to know?

No response

Kwok version

kwok version 0.5.2

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Darwin: $ uname -a # paste output here Darwin ds-mlt 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000 arm64 # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```
wzshiming commented 1 week ago

Yes, this bug was fixed in #1108, but no release contains it yet.

You can workaround it by deleting delay.durationFrom and delay.jitterDurationFrom.

dmitsh commented 1 week ago

I removed delay.durationFrom and delay.jitterDurationFrom from pod-complete.yaml, but it didn't help. The pod gets completed right away.

dmitsh commented 1 week ago

I guess I should wait for the next KWOK release and test then.