koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.33k stars 328 forks source link

[proposal] How to Better Integrate Mid-Tier Runtime Hooks with Batch Resources? #2161

Open tan90github opened 2 months ago

tan90github commented 2 months ago

What is your proposal:

In light of the proposal #1762, we have been working on the implementation of the mid-tier resource runtimeHook recently. During development, we've encountered potential redundancies in the runtimeHook implementations between mid-tier and batch resources, with some details challenging to separate distinctly. We are seeking insights from the community regarding how compatibility between mid-tier and batch was initially addressed during the development of batch resources.

Why is this needed:

With the introduction of new features such as CPU normalization in batch resources, the integration of mid-tier resources may become increasingly complex. We are interested in any advice the community might have regarding the development of mid-tier and batch resources in terms of cgroup implementation.

Is there a suggested solution, if so, please add it:

We are considering several development approaches and would appreciate any recommendations from the community:

Plan 1: Mirroring the previous PR #1984, we aim to leverage the existing batchResource code extensively, including hooks, reconcilers, and rules. This approach might lead to a unified extended resource model, where feature gating cannot differentiate between batch and mid-tier resources, potentially causing issues with shared rules for mid-tier and batch, especially with the integration of mid-tier and batch resources.

Plan 2: Copy all code from the batchresource directory and rename 'batch' to 'mid'. This allows differentiation but results in high code redundancy.

Plan 3: Refactor the batchresource code to extract fundamental functions for use by both mid-tier and batch resources, which introduces greater risk.

Other potential plans that we may not have considered yet...

tan90github commented 2 months ago

cc @zwzhang0107 @j4ckstraw @yangfeiyu20102011