kubernetes-retired / kube-batch

A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC
Apache License 2.0
1.08k stars 264 forks source link

Preemptor lost in preemption action #949

Closed lowang-bh closed 1 year ago

lowang-bh commented 3 years ago

What this PR does / why we need it: Fix issue #950 In preemption action, it wil first try to preempt from other jobs and will pop preemptor from preemptorTasks queue which is used to record the preeptor tasks. This cause preemptorTasks miss one preemptor task at least when There is no other jobs can be preempted and try to preempt other tasks in the same job

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #950 when preempt task between same job's tasks, it should compare task's priority in this job;

Special notes for your reviewer:

Release note:

k8s-ci-robot commented 3 years ago

Welcome @lowang-bh!

It looks like this is your first PR to kubernetes-sigs/kube-batch 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kube-batch has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot commented 3 years ago

Hi @lowang-bh. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k82cn commented 3 years ago

@lowang-bh would you help to share more backgroup of this PR?

k82cn commented 3 years ago

/ok-to-test

lowang-bh commented 3 years ago

@lowang-bh would you help to share more backgroup of this PR?

consider the preemption between same job's tasks, It wil first try to preempt from other jobs and will pop preemptor from preemptorTasks which is used to record the preeptor tasks. This cause preemptorTasks miss one preemptor task at least when There is no other jobs can be preempted and try to preempt other tasks in the same job. for example:

create(){
    kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    useKubeBatch: "true"
  name: preeptee
  namespace: default
spec:
  backoffLimit: 2
  completions: 2
  parallelism: 2
  ttlSecondsAfterFinished: 600  
  template:
    metadata:
      annotations:
        scheduling.k8s.io/group-name: group1
    spec:
      containers:
      - image: busybox
        imagePullPolicy: IfNotPresent
        name: busybox
        command: ['/bin/sh']
        args: ['-c', 'sleep 160']        
        resources:
          requests:
            cpu: 1000m
          #limits:
          #  nvidia.com/gpu: 3             
      restartPolicy: Never
      terminationGracePeriodSeconds: 5
      schedulerName: kube-batch
---
apiVersion: scheduling.incubator.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: group1
  namespace: default
spec:
  minMember: 1
  queue: default
EOF

    sleep 10
    kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: preeptor
  namespace: default
spec:
  backoffLimit: 2
  completions: 2
  parallelism: 2
  ttlSecondsAfterFinished: 600  
  template:
    metadata:
      annotations:
        scheduling.k8s.io/group-name: group1
    spec:
      containers:
      - image: busybox
        imagePullPolicy: IfNotPresent
        name: busybox
        command: ['/bin/sh']
        args: ['-c', 'sleep 120']        
        resources:
          requests:
            cpu: 1000m
      restartPolicy: Never
      terminationGracePeriodSeconds: 5
      schedulerName: kube-batch
      priorityClassName: high-priority
EOF
}
lowang-bh commented 3 years ago

/retest

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 3 years ago

@k8s-triage-robot: Closed this PR.

In response to [this](https://github.com/kubernetes-sigs/kube-batch/pull/949#issuecomment-958684460): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
lowang-bh commented 1 year ago

/reopen

k8s-ci-robot commented 1 year ago

@lowang-bh: Reopened this PR.

In response to [this](https://github.com/kubernetes-sigs/kube-batch/pull/949#issuecomment-1555905649): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k8s-ci-robot commented 1 year ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lowang-bh Once this PR has been reviewed and has the lgtm label, please assign k82cn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubernetes-sigs/kube-batch/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
k8s-ci-robot commented 1 year ago

@lowang-bh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kube-batch-verify f022283c0d1714e0dae7c25f1a03ad3732bddd16 link true /test pull-kube-batch-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
mrbobbytables commented 1 year ago

Repo is to be archived, for more information please see: https://github.com/kubernetes/org/issues/4200

/close

k8s-ci-robot commented 1 year ago

@mrbobbytables: Closed this PR.

In response to [this](https://github.com/kubernetes-sigs/kube-batch/pull/949#issuecomment-1563135390): >Repo is to be archived, for more information please see: https://github.com/kubernetes/org/issues/4200 > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.