Open chauncey-77 opened 3 years ago
@shivramsrivastava Please give me some feedback, thanks a lot.
@chauncey-77, here is a brief summary of CPU/Memory flow graph building. Hope, this helps you as far as basic description of building our CPU/Memory flow graph is concerned. Rest of the details, you going to have to read through code to get more low level understanding.
As a quick brief overview, Firmament scheduler models scheduling problem as a constraint-based optimization over a flow network graph. A min-cost flow algorithm is leveraged for deriving the implied workload placements. A flow network is a directed graph whose arcs carry flow from source nodes (i.e., pod nodes) to a sink node for a feasible solution to the optimization problem. A cost and capacity associated with each arc constrain the flow, and specify preferential routes for it.
In the Poseidon/Firmament CPU-Memory cost model, the task equivalence class (EC) gets created based on the task’s cpu and memory request. Each machine will have a set of predefined number of machine ECs (M0EC1, M0EC2,.., M2EC2) in order to do load distribution across filtered machines during each scheduling iteration.
It is important to highlight that if we have only one arc from task EC node to machine node, then there is a chance that all the incoming flows (tasks) flow across the single arc, and overloading the single machine with so many tasks even though there are machines with lot of available resources. So to avoid scheduling many tasks on a same machine, we use multiple arcs between task EC and machine nodes using intermediate machine EC nodes. We connect task EC to multiple machine ECs and then these machine ECs are connected to corresponding machine node. The capacity on the arc (task EC to machine EC) and arc (machine EC to machine node) is set to 1. And costs on arcs between task EC and machine ECs are assigned incrementally.
Let us take an example where there are three machines M0, M1 & M2 and each machine has a capacity of 2 flows. Load distribution is achieved by assigning two arcs via corresponding machine ECs and the cost on each arc increases incrementally. In this case, arcs connecting to machine ECs for machine M0 have value of cost 100 and 200 respectively. Capacity on these arcs would be 1 each. In a nutshell, for each unit of capacity, there would be corresponding machine EC. The cost on these arcs would increase incremental in order to achieve load distribution across machines.
So in summary, we essentially create an equivalence class called ‘Ec mem-cpu’ based on the job id hash value, this will aggregate all the task from a job in one EC. As part of the step to connect arc from Job/Task to the EC, we use cpu and mem only for calculating the normalized cost.
@deepak-vij Hi, thanks for your answer, but I have some questions: 1) What does the EC means in the last sentence — "As part of the step to connect arc from Job/Task to the EC"? Task EC or machine EC? I guess is machine EC, because there is a function calculates normalized cost between taskEC and machineEC in cpu_cost_model.cc called EquivClassToEquivClass, but I am not sure. 2) As you say, the Firmament aggragates all the tasks from a job in one EC(task EC), but in my understanding, each k8s' pod corresponding one Firmament's job, so when using Firmament in k8s cluster, is the task EC necessary? 3) Will the successfully scheduled pods remain in the flow graph? If a pod/task has been scheduled, will it continue to exist? or only the unscheduled pods/tasks will be retained in the flow graph? 4) What's the meaning of 'pu' appeared in many places? Is the capacity of each machine related to the max_tasks_per_pu?
- What does the EC means in the last sentence — "As part of the step to connect arc from Job/Task to the EC"? Task EC or machine EC? I guess is machine EC, because there is a function calculates normalized cost between taskEC and machineEC in cpu_cost_model.cc called EquivClassToEquivClass, but I am not sure. EC (Equivalence Class) in the last sentence is for machine ECs. Normalized CPU/Mem cost is assigned to the arc between the Job EC and machine ECs.
- As you say, the Firmament aggragates all the tasks from a job in one EC(task EC), but in my understanding, each k8s' pod corresponding one Firmament's job, so when using Firmament in k8s cluster, is the task EC necessary? Typically, service jobs have one task (or pod) per job. However, batch jobs have multiple tasks per job. This is needed to provide support for batch jobs (gang scheduling etc.).
- Will the successfully scheduled pods remain in the flow graph? If a pod/task has been scheduled, will it continue to exist? or only the unscheduled pods/tasks will be retained in the flow graph? The whole cluster state stays in memory. Currently, Firmament cluster state (flow graph) is not persistent-backed.
- What's the meaning of 'pu' appeared in many places? Is the capacity of each machine related to the max_tasks_per_pu? Yes, that is correct.
@deepak-vij Hi, thanks for your answer, but I still have doubts:
The whole cluster state stays in memory. Currently, Firmament cluster state (flow graph) is not persistent-backed. If the pods which have been successfully scheduled still remain in the flow graph, will it affect those pods to be scheduled in next scheduling rounds? I means will this happen that the solver returns a max flow through those pods have been successfully scheduled?
Yes, that is correct. When i use poseidon scheduler to schedule a k8s deployment with 50 pods, some of the pods are placed in a machine which already has 114 pods, although the value of max_tasks_per_pu is 110. It seems something wrong but i have't figured out.
Is there a way to visualize the flow graph currently existing in memory?
If the pods which have been successfully scheduled still remain in the flow graph, will it affect those pods to be scheduled in next scheduling rounds? I means will this happen that the solver returns a max flow through those pods have been successfully scheduled?
The reason we have all the pods (scheduled and unscheduled) in the flow graph because Firmament has capability of doing rescheduling in every run. Although, by default rescheduling is turned off and we have not tested core Firmament rescheduling logic extensively at this time. It does need some work to smoothen out the kinks that is why we left it alone at this time.
When i use poseidon scheduler to schedule a k8s deployment with 50 pods, some of the pods are placed in a machine which already has 114 pods, although the value of max_tasks_per_pu is 110. It seems something wrong but i have't figured out.
I am not sure about this, there may be boundary condition bug or incorrect formula. You would have to look into the code to resolve this.
@deepak-vij Hi! These days I have learned a lot about the Poseidon/Firmament scheduler, and I have some understand, but I don't know its correctness. 1) After generating a graph based on cpu_mem cost, specifically cpu_mem cost and cpu_mem balanced cost in our codes, It can not guarantee that the scheduling result is globally optimal when more than one pod to be scheduled in a scheduling period.
It is important to highlight that if we have only one arc from task EC node to machine node, then there is a chance that all the incoming flows (tasks) flow across the single arc, and overloading the single machine with so many tasks even though there are machines with lot of available resources.
As you say, in order to avoid all pods being scheduled to the node with the minimum cost, the machine EC is introduced. So, the cost butween machine EC and machine will affect the schedule result. So are there any considerations when assigning the cost between machine EC and machine?
2) On the situation mentioned in1), what the flow-based scheduler's advantage when compared with queue-based scheudler because of its non global optimality?
Hi, I am very interested in your work about modeling scheduling problem using flow graph method, especially using the cpu_cost_model. And I am trying to understand the detail of the modeling process, i.e., the specific process to build the flow graph from scratch based on the state of the cluster and scheduling strategy (cpu_cost_model). But unfortunately, I did not find much documentation about this. And then I spent mach time to read the code. When I read the code, I encountered some difficulties, for example, I can understand the meaning of some functions but it's difficult for me to string them together, and I still don't kown the meaning of PU node. I think relevant documentations will be very helpful. So I new this issue. I want to know are there closely related documents about the detail of cpuCostModel. It would be better if there is a simple but realistic modeling case about cpu_cost_model. I really want to finger out the specific process of building flow graph from scratch based on the state of the cluster and cpu_mem scheduling strategy. Or If you have any better suggestions to help me understand these things, thank you a lot. Thanks again.