Closed vsoch closed 3 weeks ago
To ensure I can fully reproduce the problem, the testing procedure is:
May I know each Job's replicas number and I suppose each PodGroup's spec are basically the same? (w/ scheduleTimeoutSeconds=10 and minMember equal to the Job's replicas number)
Also, are you running on a 8-nodes cluster? and what's the CPU capacity in each node?
Last, recently we introduced a perf fix for coscheduling, the latest master and Helm image (v0.28.9) should have contained it.
hey @Huang-Wei ! I figured this out - the default plugin config in the helm chart install values.yaml
is 10 seconds:
pluginConfig:
- name: Coscheduling
args:
permitWaitingTimeSeconds: 10
And for the experiments I was running, even the default of 60 was too low. I bumped this up to 300 seconds and the errors resolved and was able to get it working. I should have read the error more closely (at the time of posting this I did not):
rejected due to timeout after waiting 10s at plugin Coscheduling
I'm wondering - should that default maybe be upped to something like 120? It would provide the example to customize the argument, but without limiting a test case that someone might have, which might be more extensive than a small hello world case (at least for me it was).
And for the experiments I was running, even the default of 60 was too low.
May I know the size (minMember) of the PodGroup in your experiment?
should that default maybe be upped to something like 120?
120 seems too much as a general default value IMO. Actually in additional to plugin-level config, it also honors PodGroup-level config, which can be specified in the PodGroup spec, and it takes precedence over the plugin-level one:
My experiment had a few hundred jobs ranging in sizes from 2 to 6. So the groups weren't huge, but the size of the queue was. It would have been hugely helped with kueue, but we wanted to test coscheduling on its own.
120 seems too much as a general default value IMO. Actually in additional to plugin-level config, it also honors PodGroup-level config, which can be specified in the PodGroup spec, and it takes precedence over the plugin-level one:
Sure, and agree. And I do that as well - I was generating the PodGroup specs dynamically and arbitrarily decided to put the setting at the config level. You are right I could have done it the other way around. Anyhoo, we are good to close the issue if you don't see any need for follow up or changes.
My experiment had a few hundred jobs ranging in sizes from 2 to 6. So the groups weren't huge, but the size of the queue was. It would have been hugely helped with kueue, but we wanted to test coscheduling on its own.
I will find time to simulate it locally. This could be a good test to verify result of some ongoing work (e.g., #661)
Great! Here is the automation for what we are running - I'm building a tool to collect data about scheduler decisions to add to this, but that should minimally reproduce (and you can change the timeout or look at earlier runs (the directory names) to find the initial bug. https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run10#coscheduling
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
@Huang-Wei the bot closed the issue, but did you ever get to test this?
did you ever get to test this?
Not yet.
Let me re-open it in case anyone can look into it.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
Hi! I want to make sure I'm not doing anything wrong. I bring up a new cluster on GKE:
And then install the scheduler plugin as a custom scheduler:
And am running jobs, about 190 total that look like variants of this (note each has a podgroup, job, and service):
I think that logic is sane because the first few our of the gate (only 3) run to completion and I have logs:
I can also verify that other plugins we are testing can run all jobs to completion, so it's not an issue (as far as I can see) with the script to get the logs, which basically just submits and then watches for complete and saves the log with one request. I get three jobs total that run, then it loops like this forever:
What we are doing that is non-standard is bulk submission at once - do you see any potential gotchas there, or something else? Thanks for the help!