How is shared-state implemented？

katoomegumi commented 3 months ago

According to the paper, godel-scheduler is a shared-state scheduler. Where can I find the implementation in the code? Particularly how to synchronize the of the global cluster view?

NickrenREN commented 3 months ago

based on list-watch mechanism

katoomegumi commented 2 months ago

Sorry, I've learned about the list-watch but still have difficulties. The list-watch mechanism is mainly through monitoring the events about create, delete, etc. But what puzzles me is how can multiple schedulers get the global cluster view. Does every scheduler supervise all these events so they don't need synchronization; or they synchronize to a central global cluster view at a certain frequency ( the real-time synchronization )? And which struct in the code serve as the central global cluster view? I'm not sure about that.Is it is commoncache in the struct binder or the generationstore or some other struct?

NickrenREN commented 2 months ago

@katoomegumi each scheduler instance watches all events from apiserver(etcd), they don't need to sync up with each other.

katoomegumi commented 2 months ago

@NickrenREN Thanks, and I think it's impossible to sync scheduler's cache for every event. so I think the code define the time internal to sync cache from events. Is it true?

// pkg/scheduler/scheduler.go
// func Run
if utilfeature.DefaultFeatureGate.Enabled(features.SchedulerCacheScrape) {
        // The metrics agent scrape endpoint every 5s and flush them to the metrics server every 30s. To
        // be more precise, scrape cache metrics every 5s.
        go wait.Until(func() {
            sched.commonCache.ScrapeCollectable(sched.metricsRecorder)
            sched.metricsRecorder.UpdateMetrics()
        }, 5*time.Second, sched.StopEverything)
    }

NickrenREN commented 2 months ago

@katoomegumi No, scheduler can receive every event and react to them (update cache and queue based on events). The code you posted is for collecting metrics, not the syncing cache logic. btw, godel scheduler is created on the basis of kubernetes. you can spend more time on kubernetes and etcd.

Wang-Xinkai commented 2 months ago

@NickrenREN Thanks for the reply. Actually, we are interested in the "watch delay" in godel scheduler, which refers to the duration between event in etcd (e.g., cluster resource change) and each scheduler actually watch the event (update its cache). It is obvious that with higher QPS and larger cluster, the "watch delay" would be more severe... Admittedly, it is an inherent problem of K8S itself, but we wonder if godel made characterizations or specific optimizations of the "watch delay"?

FYI, the related discussion in k8s repo: https://github.com/kubernetes/kubernetes/issues/108556

NickrenREN commented 2 months ago

@Wang-Xinkai hello, we optimize the "event latency" from two aspects.

one is the server side: we don't use etcd for large scale clusters in Bytedance, we use Kubebrain + ByteKV instead. This is not done in Godel Scheduler.
another one is the client side: we optimize the event processing workflow in Godel Scheduler, so that events won't be stuck in delta queue.

Wang-Xinkai commented 2 months ago

Thanks. I have checked the client-side optimizations. According to my understanding, realistic godel has real-time resource view of its corresponding sub-cluster based on the dual-side optimizations on "event delay"? In such case, the event delay just relates to the network communication cost between apiserver and scheduler.

NickrenREN commented 2 months ago

@Wang-Xinkai My understanding is: event latency depends on three parts: 1. apiserver and etcd processing efficiency; 2. network condition between apiserver and client; 3. client processing efficiency.

We are now optimizing 1 and 3 to accelerate the event processing flow. But we can't say every thing (server and client sides) will alway be ok, so, we can't say the event delay just relates to the network communication cost between apiserver and scheduler.

Network conditon has nothing to do with k8s ecosystem, but in the future, we can explore if we can do something to simplify the interaction process between godel scheduler components. e.g. for now, all godel scheduler components get events from apiserver, can we let them talk to each other directly ? ...

Wang-Xinkai commented 2 months ago

Interesting idea lol. I agree with you on event delay decomposition! Do you have some cursory estimation of the scale of event delay: tens-of-ms, hundreds-of-ms, second-scale? under normal scenarios and extreme high load scenarios.

I mean, if the event delay is huge, the scheduler would be blind to some "free resource" in the cluster during the event delay, that causes great waste of cluster resource (if there are tasks waiting to be scheduled). That's why we are interested in this metric. Thanks.

NickrenREN commented 2 months ago

@Wang-Xinkai IIUC, you are worried about the Node resource update events latency ?

Wang-Xinkai commented 2 months ago

Right, do you have some thoughts about this issue? Or the experiences of the actual latency in realistic clusters? We suspect it affects the resource visibility of schedulers…

NickrenREN commented 2 months ago

@Wang-Xinkai In kubernetes, different resources (node, pods...) have different event transmission links. The number of nodes is not that large, so node resource is less likely to cause latency issues. At least, in Bytedance, we have never meet this kind of problems (our largest single cluster size: 20k nodes, 1000k pods)

Wang-Xinkai commented 2 months ago

okay, thanks for your generous replies. We will use Godel to study more about shared-state schedulers. Keep in touch!

NickrenREN commented 2 months ago

@Wang-Xinkai Cool, if you have any question, feel free to reach out to me

kubewharf / godel-scheduler

How is shared-state implemented？ #37