koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.31k stars 327 forks source link

[BUG] Coscheduling Timeout Cannot exceed 15m #1919

Open ls-2018 opened 7 months ago

ls-2018 commented 7 months ago
timeout has two Settings
1: scheduler parameters
2: pod declaration
Should we add the corresponding logic that the timeout period is not allowed to exceed 15m

What happened:

 Coscheduling can set the default parameters in the scheduler, also can be set on the pod gang.scheduling.koordinator.sh/waiting-time It is used to determine the waiting time of coscheduling.

However, if the time exceeds 15 minutes, the k8s official scheduler will be removed from the waitingPods and will not be scheduled again.

Suppose group a has a subgroup b, and now A has created a pod, but b has not. This is a: The pod will enter the waitingPods waiting to allow, and then bind.
But if you wait longer for b to be ready than allow allows, a:pod will not bind again

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

1.yaml

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: a
  namespace: default
  annotations:
    "gang.scheduling.koordinator.sh/total-number": "10"
    "gang.scheduling.koordinator.sh/groups": '["b"]'
spec:
  scheduleTimeoutSeconds: 3000
  minMember: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-example1
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: a
spec:
  schedulerName: koord-scheduler
  containers:
  - command:
    - "sleep"
    - "365d"
    image: busybox
    name: curlimage

2.yaml

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: b
  namespace: default
spec:
  scheduleTimeoutSeconds: 3000
  minMember: 1
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-example2
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: b
spec:
  schedulerName: koord-scheduler
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    name: curlimage

Anything else we need to know?:

Environment:

ls-2018 commented 7 months ago

I look forward to your reply, and I would be happy to work with you to solve this problem.

ZiMengSheng commented 7 months ago

Can you exec kubectl describe pod-example1 -n default, and give me the message abount why pod-example1 is unschedulable?

ZiMengSheng commented 7 months ago

In your example, PodGroup A has configured scheduleTimeoutSeconds as 10, so in theory PodGroupA will be timeout after 10 seconds. However, in our current implementation, the timeout configuration of PodGroup just means the max wait time since first pod comes to permit stage, and won't be persisted as podgroup/pod status in apiserver and won't also block pod scheduling process. So would you like give me more detail message abount why pod is unschedulable.

ls-2018 commented 7 months ago

Sorry, there is an error in the yaml I provided, I will fix it later and provide more information

ls-2018 commented 7 months ago
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k delete -f .              
podgroup.scheduling.sigs.k8s.io "a" deleted
pod "pod-example1" deleted
podgroup.scheduling.sigs.k8s.io "b" deleted
pod "pod-example2" deleted
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k apply -f 1.yaml          
podgroup.scheduling.sigs.k8s.io/a created
pod/pod-example1 created
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 date                  
Thu Mar  7 13:38:29 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 sleep 2000 && k apply -f 1.yaml && date
^Z
[1]  + 79592 suspended  sleep 2000
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 date
Thu Mar  7 14:06:39 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k apply -f 1.yaml && date
podgroup.scheduling.sigs.k8s.io/a unchanged
pod/pod-example1 unchanged
Thu Mar  7 14:06:46 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k apply -f 2.yaml && date
podgroup.scheduling.sigs.k8s.io/b created
pod/pod-example2 created
Thu Mar  7 14:06:55 CST 2024
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 k describe pod pod-example1
Name:             pod-example1
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           pod-group.scheduling.sigs.k8s.io=a
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  curlimage:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
      365d
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c9nvx (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-c9nvx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From             Message
  ----     ------            ----   ----             -------
  Warning  FailedScheduling  13m    koord-scheduler  rejected due to timeout after waiting 15m0s at plugin Coscheduling
  Warning  FailedScheduling  13m    koord-scheduler  running PreFilter plugin "Coscheduling": %!!(MISSING)w(<nil>)
  Warning  FailedScheduling  8m16s  koord-scheduler  running PreFilter plugin "Coscheduling": %!!(MISSING)w(<nil>)
➜ /Users/acejilam/Desktop/koordinator/bug/case git:(cn) ✗ 🐥 
ls-2018 commented 7 months ago

As long as you sleep in between for a while, you can reproduce

image
ls-2018 commented 7 months ago

/cc @ZiMengSheng

ZiMengSheng commented 6 months ago

can you give me scheduler log abount why pod-example1 coscheduling prefilter failed, current Prefilter failed message is a little confusing due to known kube-scheduler bug.

ZiMengSheng commented 6 months ago

i make a test and have got the point. PodGroup default/a has total number of 10 and min number of 1.

With totalChildrenNum's help, when the last pod comes to make all childrenScheduleRoundMap's values equal to scheduleCycle, Gang's scheduleCycle will be added by 1, which means a new schedule cycle.

In our example, pod-example1 gets rejected due to timeout of waiting PodGroupB. Scheduling cycle of pod-example1 is added to 1 after prefilter. When pod-example1 enter into scheduling cycle next time, gang scheduling cycle won't be added because num(child of which schedule cycle equals gang scheduling cycle) is one < totalChildrenNumber. thus Prefilter failed.

New schedule cycle will never arrive until you submit enough children of PodGroupA. So just submit all children?

ls-2018 commented 6 months ago

@ZiMengSheng in goup/a I specified the number of minmembers to be 1. If I still need to increase the number of Pods, this is not consistent with my expectation.

ZiMengSheng commented 6 months ago

OK, you opinion are right and welcome. There are some inconsistencies in the design. We need to fix it in the code and design doc. Do you have the time and interest to fix it?

ls-2018 commented 5 months ago

I'd love to fix it. But I don't have a specific idea of how best to fix it. We also want to hear from the community.

eahydra commented 5 months ago

I'd love to fix it. But I don't have a specific idea of how best to fix it. We also want to hear from the community.

Welcome to contribute! Just do it!

jasonliu747 commented 4 months ago

@ls-2018 any updates? ;)