Rejection due to timeout / unreserve

vsoch commented 6 months ago

Hi! I want to make sure I'm not doing anything wrong. I bring up a new cluster on GKE:

gcloud container clusters create test-cluster \
    --threads-per-core=1 \
    --placement-type=COMPACT \
    --num-nodes=8 \
    --no-enable-autorepair \
    --no-enable-autoupgrade \
    --region=us-central1-a \
    --project=${GOOGLE_PROJECT} \
    --machine-type=c2d-standard-8

And then install the scheduler plugin as a custom scheduler:

git clone --depth 1 https://github.com/kubernetes-sigs/scheduler-plugins /tmp/sp
cd /tmp/sp/manifests/install/charts
helm install coscheduling as-a-second-scheduler/

And am running jobs, about 190 total that look like variants of this (note each has a podgroup, job, and service):

apiVersion: v1
kind: Service
metadata:
  name: s906
spec:
  clusterIP: None
  selector:
    job-name: job-0-9-size-6
---
apiVersion: batch/v1
kind: Job
metadata:
  # name will be derived based on iteration
  name: job-0-9-size-6
spec:
  completions: 6
  parallelism: 6
  completionMode: Indexed
  # alpha in 1.30 so not supported yet
  # successPolicy:
  #  - succeededIndexes: "0"
  template:
    metadata:
      labels:
        app: job-0-9-size-6
        scheduling.x-k8s.io/pod-group: job-0-9-size-6

    spec:
      subdomain: s906
      schedulerName: scheduler-plugins-scheduler
      restartPolicy: Never
      containers:
      - name: example-workload
        image: bash:latest
        resources:
          limits:
            cpu: "2"
          requests:
            cpu: "2"
        command:
        - bash
        - -c
        - |
          if [ $JOB_COMPLETION_INDEX -ne "0" ]
            then
              sleep infinity
          fi
          echo "START: $(date +%s%N | cut -b1-13)"
          for i in 0 1 2 3 4 5
          do
            gotStatus="-1"
            wantStatus="0"             
            while [ $gotStatus -ne $wantStatus ]
            do                                       
              ping -c 1 job-0-9-size-6-${i}.s906 > /dev/null 2>&1
              gotStatus=$?                
              if [ $gotStatus -ne $wantStatus ]; then
                echo "Failed to ping pod job-0-9-size-6-${i}.s906, retrying in 1 second..."
                sleep 1
              fi
            done                                                         
            echo "Successfully pinged pod: job-0-9-size-6-${i}.s906"
          done
          echo "DONE: $(date +%s%N | cut -b1-13)"
          # echo "DONE: $(date +%s)"

---
# PodGroup CRD spec
apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: job-0-9-size-6
spec:
  scheduleTimeoutSeconds: 10
  minMember: 6

I think that logic is sane because the first few our of the gate (only 3) run to completion and I have logs:


===
Output: recorded-at: 2024-04-17 12:17:50.257757
START: 1713356244
Failed to ping pod job-0-0-size-2-0.s002, retrying in 1 second...
Failed to ping pod job-0-0-size-2-0.s002, retrying in 1 second...
Successfully pinged pod: job-0-0-size-2-0.s002
Successfully pinged pod: job-0-0-size-2-1.s002
DONE: 1713356246

===
Times: recorded-at: 2024-04-17 12:17:50.257899
{"end_time": "2024-04-17 12:17:50.257629", "start_time": "2024-04-17 12:17:50.118119", "batch_done_submit_time": "2024-04-17 12:17:49.217638", "submit_time": "2024-04-17 12:17:21.270262", "submit_to_completion": 28.987367, "total_time": 0.13951, "uid": "scheduler-plugins-scheduler-batch-0-iter-0-size-2"}

I can also verify that other plugins we are testing can run all jobs to completion, so it's not an issue (as far as I can see) with the script to get the logs, which basically just submits and then watches for complete and saves the log with one request. I get three jobs total that run, then it loops like this forever:

I0417 12:22:39.317221       1 coscheduling.go:215] "Pod is waiting to be scheduled to node" pod="default/job-0-0-size-4-0-gkv47" nodeName="gke-test-cluster-default-pool-b4ebeb32-trmp"
E0417 12:22:40.700252       1 schedule_one.go:1004] "Error scheduling pod; retrying" err="rejected due to timeout after waiting 10s at plugin Coscheduling" pod="default/job-0-0-size-4-2-h4mv8"
E0417 12:22:40.768269       1 schedule_one.go:1004] "Error scheduling pod; retrying" err="rejection in Unreserve" pod="default/job-0-0-size-4-0-gkv47"
I0417 12:22:43.604752       1 trace.go:236] Trace[927737765]: "Scheduling" namespace:default,name:job-0-0-size-4-1-gpp59 (17-Apr-2024 12:22:43.015) (total time: 589ms):
Trace[927737765]: ---"Computing predicates done" 588ms (12:22:43.604)
Trace[927737765]: [589.250812ms] [589.250812ms] END

What we are doing that is non-standard is bulk submission at once - do you see any potential gotchas there, or something else? Thanks for the help!

Huang-Wei commented 6 months ago

To ensure I can fully reproduce the problem, the testing procedure is:

submit 160 Jobs (at once I suppose?)
each Job associated w/ its own PodGroup

May I know each Job's replicas number and I suppose each PodGroup's spec are basically the same? (w/ scheduleTimeoutSeconds=10 and minMember equal to the Job's replicas number)

Also, are you running on a 8-nodes cluster? and what's the CPU capacity in each node?

Last, recently we introduced a perf fix for coscheduling, the latest master and Helm image (v0.28.9) should have contained it.

vsoch commented 6 months ago

hey @Huang-Wei ! I figured this out - the default plugin config in the helm chart install values.yaml is 10 seconds:

pluginConfig:
- name: Coscheduling
  args:
    permitWaitingTimeSeconds: 10

And for the experiments I was running, even the default of 60 was too low. I bumped this up to 300 seconds and the errors resolved and was able to get it working. I should have read the error more closely (at the time of posting this I did not):

rejected due to timeout after waiting 10s at plugin Coscheduling

I'm wondering - should that default maybe be upped to something like 120? It would provide the example to customize the argument, but without limiting a test case that someone might have, which might be more extensive than a small hello world case (at least for me it was).

Huang-Wei commented 6 months ago

And for the experiments I was running, even the default of 60 was too low.

May I know the size (minMember) of the PodGroup in your experiment?

should that default maybe be upped to something like 120?

120 seems too much as a general default value IMO. Actually in additional to plugin-level config, it also honors PodGroup-level config, which can be specified in the PodGroup spec, and it takes precedence over the plugin-level one:

https://github.com/kubernetes-sigs/scheduler-plugins/blob/3f841b49a5d9e256ebaf3470daf9dcbf6581be24/pkg/util/podgroup.go#L64-L67

vsoch commented 6 months ago

My experiment had a few hundred jobs ranging in sizes from 2 to 6. So the groups weren't huge, but the size of the queue was. It would have been hugely helped with kueue, but we wanted to test coscheduling on its own.

vsoch commented 6 months ago

120 seems too much as a general default value IMO. Actually in additional to plugin-level config, it also honors PodGroup-level config, which can be specified in the PodGroup spec, and it takes precedence over the plugin-level one:

Sure, and agree. And I do that as well - I was generating the PodGroup specs dynamically and arbitrarily decided to put the setting at the config level. You are right I could have done it the other way around. Anyhoo, we are good to close the issue if you don't see any need for follow up or changes.

Huang-Wei commented 6 months ago

My experiment had a few hundred jobs ranging in sizes from 2 to 6. So the groups weren't huge, but the size of the queue was. It would have been hugely helped with kueue, but we wanted to test coscheduling on its own.

I will find time to simulate it locally. This could be a good test to verify result of some ongoing work (e.g., #661)

vsoch commented 6 months ago

Great! Here is the automation for what we are running - I'm building a tool to collect data about scheduler decisions to add to this, but that should minimally reproduce (and you can change the timeout or look at earlier runs (the directory names) to find the initial bug. https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run10#coscheduling

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/scheduler-plugins/issues/722#issuecomment-2354299311): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

vsoch commented 1 month ago

@Huang-Wei the bot closed the issue, but did you ever get to test this?

Huang-Wei commented 1 month ago

did you ever get to test this?

Not yet.

Let me re-open it in case anyone can look into it.

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 3 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/scheduler-plugins/issues/722#issuecomment-2421515489): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / scheduler-plugins

Rejection due to timeout / unreserve #722