delay.durationMilliseconds in pod-complete does not take effect

dmitsh commented 4 months ago

How to use it?

[X] kwok
[ ] kwokctl --runtime=docker (default runtime)
[ ] kwokctl --runtime=binary
[ ] kwokctl --runtime=nerdctl
[ ] kwokctl --runtime=kind

What happened?

I'm running KWOK v0.5.2 in a cluster. I deployed a set of stages from stage-fast.yaml. Then I updated pod-complete by copying kustomize/stage/pod/general/pod-complete.yaml, setting delay.durationMilliseconds to 10000, and deploying it in the cluster. Then I deploy a job. The job does not have annotation pod-complete.stage.kwok.x-k8s.io/delay When the job starts, the pods are marked Completed right away:

% k get po
NAME           READY   STATUS      RESTARTS   AGE
job1-0-prkbn   0/1     Completed   0          3s
job1-1-zg6tc   0/1     Completed   0          3s

What did you expect to happen?

IIUC, in the absence of pod-complete.stage.kwok.x-k8s.io/delay annotation, the delay.durationMilliseconds specifies how long the pods should run before changing the status to Completed.

How can we reproduce it (as minimally and precisely as possible)?

Deploy kwok and stages in the cluster

kubectl apply -f https://github.com/kubernetes-sigs/kwok/releases/download/v0.5.2/kwok.yaml
kubectl apply -f https://github.com/kubernetes-sigs/kwok/releases/download/v0.5.2/stage-fast.yaml

Get a copy of pod-complete.yaml

wget https://github.com/kubernetes-sigs/kwok/raw/main/kustomize/stage/pod/general/pod-complete.yaml

Set delay.durationMilliseconds to 10000 in pod-complete.yaml
Update the stage
```
kubectl apply -f pod-complete.yaml
```

Create job spec

cat <<EOF > job1.yaml
apiVersion: batch/v1
kind: Job
metadata:
labels:
batch.kubernetes.io/job-name: job1
job-name: job1
name: job1
namespace: default
spec:
backoffLimit: 0
completionMode: Indexed
completions: 2
manualSelector: false
parallelism: 2
podReplacementPolicy: TerminatingOrFailed
suspend: false
template:
metadata:
  creationTimestamp: null
  labels:
    batch.kubernetes.io/job-name: job1
    job-name: job1
spec:
  containers:
  - image: ubuntu
    imagePullPolicy: IfNotPresent
    name: test
    resources:
      limits:
        cpu: 100m
        memory: 512M
        nvidia.com/gpu: "8"
      requests:
        cpu: 100m
        memory: 512M
        nvidia.com/gpu: "8"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  dnsPolicy: ClusterFirst
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext: {}
  terminationGracePeriodSeconds: 30
EOF

Deploy the job and check pod status

kubectl apply -f job1.yaml; sleep 1; kubectl get po

Observe that the pods are completed (and not running as I would expect).

Anything else we need to know?

No response

Kwok version

kwok version 0.5.2

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Darwin: $ uname -a # paste output here Darwin ds-mlt 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000 arm64 # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

wzshiming commented 4 months ago

Yes, this bug was fixed in #1108, but no release contains it yet.

You can workaround it by deleting delay.durationFrom and delay.jitterDurationFrom.

dmitsh commented 4 months ago

I removed delay.durationFrom and delay.jitterDurationFrom from pod-complete.yaml, but it didn't help. The pod gets completed right away.

dmitsh commented 4 months ago

I guess I should wait for the next KWOK release and test then.

dmitsh commented 4 months ago

I tested this scenario with the latest v0.6.0 release, and got the same outcome

wzshiming commented 4 months ago

I tested it and it worked as expected.

kind create cluster
helm repo add kwok https://kwok.sigs.k8s.io/charts/
helm upgrade --namespace kube-system --install kwok kwok/kwok
helm upgrade --install kwok kwok/stage-fast

wget https://github.com/kubernetes-sigs/kwok/raw/main/kustomize/stage/pod/general/pod-complete.yaml
# Set delay.durationMilliseconds to 10000 in pod-complete.yaml
# Set delay.jitterDurationMilliseconds to 20000 in pod-complete.yaml

kubectl apply -f pod-complete.yaml

# create node and job same with you

# Observe that the pods to complete between 10 and 20 seconds.

dmitsh commented 4 months ago

Could you explain to me the difference between durationMilliseconds and jitterDurationMilliseconds? Is this documented somewhere? I thought the durationMilliseconds is the time the pod is in the running state before switching to "completed", and the jitter is the random value (between 0 and jitterDurationMilliseconds) added to the durationMilliseconds.

wzshiming commented 4 months ago

I thought the durationMilliseconds is the time the pod is in the running state before switching to "completed", and the jitter is the random value (between 0 and jitterDurationMilliseconds) added to the durationMilliseconds.

Yes, the initial definition was the same as what you said. But setting the time for forced deletion is the reason for this.

https://github.com/kubernetes-sigs/kwok/blob/35bdf70f89e28b89ace34e4921e590c1169f5cc5/kustomize/stage/pod/general/pod-delete.yaml#L22-L23

FYI .metadata.deletionGracePeriodSeconds is only graceful delete time and is not valid for the case

In fact, the definition of the API leaves a lot to be desired, it is planned to introduce CEL as a supplement to JQ.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes-sigs / kwok