runtime: investigate possible Go scheduler improvements inspired by Linux Kernel's CFS

First, please let me pay tribute to your contributions. You guys are awesome! And Go is so marvelous!

It has been more than ten years, and Go has already been very successful. So, I think it is time to give Go a better scheduler -- maybe like the CFS scheduler of Linux kernel.

The current implementation of the scheduling strategy of the Go runtime scheduler is basically Round-Robin scheduling. But in practical application scenarios, priority relationships may need to exist between different types of tasks or even between tasks of the same type. For example, if we want to use Go to implement a storage engine, in general, I/O tasks should have higher priority than any other CPU intensive tasks. Because in this case, the system bottleneck is more likely to be in disk I/O, not CPU. For another example, we may use Go to implement a network service, which is required to ensure the latency of some certain lightweight tasks even while other goroutines are quite CPU intensive. However, in the current implementation of Go runtime, if there are a certain number of CPU intensive tasks that need to run for a long time, it will inevitably impact the scheduling latency. Therefore, we need a real mature Go scheduler like the CFS scheduler in kernel.

From another perspective, we could compare the thread model provided by kernel with the goroutine model provided by Go. In addition to the inherent low-cost and high-concurrency advantages of goroutine over threads, some very useful mechanisms in the kernel thread model cannot be found in the goroutine model of Go, at least not yet. For example, dynamic modification of scheduling policy and priority mechanism including adaptive priority adjustment.

The scheduler of the initial versions of Linux kernel, e.g. v0.01, is quite primitive. But with the continuous development of the kernel, more and more applications have higher and higher requirements for schedulers, and the kernel scheduler has been keeping evolving until today's CFS scheduler.

That should also apply to Go. A real mature scheduler will make Go even greater.

I have already developed a customized scheduler over Go runtime as a demo and which proves my proposal is feasible and such scheduler would benefit many applications which have high requirements from the perspective of latency and throughput.

Terminology

Event intensive task: Most of the time in its life cycle is waiting for events (just name it off CPU), and only a very small part of the time is doing CPU computing.

CPU intensive task: Most of or even all of its life cycle is doing CPU computing (just name it on CPU), and only a very small part of its time or has never been in the state of waiting for events.

Description

Basically, the idea of CFS scheduling is to give every thread a logical clock which records the duration of thread's on-cpu time. Different priority settings of threads, different speed time passes of its logical clock. And CFS scheduler would prefer to choose the thread which has the most behind logical clock to run. Because the scheduler thinks it is quite unfair to make such thread even more behind. After all, the scheduler's name CFS stands for Completely Fair Scheduler. So, if one thread is event intensive, then it would have a higher de facto priority. And if one thread is cpu intensive, then it would have a lower de facto priority. That we could call it adaptive priority adjustment. That is a quite important feature which could ensure the scheduling latency of event intensive threads will not be very high even current system is under high cpu load due to the existence of some cpu intensive threads.

Although this demo only implements some part of CFS scheduling, the result is quite promising.

Demo

cpuworker/example/demo.go:

package main

import (
    "crypto/rand"
    "fmt"
    "hash/crc32"
    "log"
    _ "net/http/pprof"

    mathrand "math/rand"
    "net/http"
    "runtime"
    "time"

    "github.com/hnes/cpuworker"
)

var glCrc32bs = make([]byte, 1024*256)

func cpuIntensiveTask(amt int) uint32 {
    var ck uint32
    for range make([]struct{}, amt) {
        ck = crc32.ChecksumIEEE(glCrc32bs)
    }
    return ck
}

func cpuIntensiveTaskWithCheckpoint(amt int, checkpointFp func()) uint32 {
    var ck uint32
    for range make([]struct{}, amt) {
        ck = crc32.ChecksumIEEE(glCrc32bs)
        checkpointFp()
    }
    return ck
}

func handleChecksumWithCpuWorkerAndHasCheckpoint(w http.ResponseWriter, _ *http.Request) {
    ts := time.Now()
    var ck uint32
    cpuworker.Submit1(func(checkpointFp func()) {
        ck = cpuIntensiveTaskWithCheckpoint(10000+mathrand.Intn(10000), checkpointFp)
    }).Sync()
    w.Write([]byte(fmt.Sprintln("crc32 (with cpuworker and checkpoint):", ck, "time cost:", time.Now().Sub(ts))))
}

func handleChecksumSmallTaskWithCpuWorker(w http.ResponseWriter, _ *http.Request) {
    ts := time.Now()
    var ck uint32
    cpuworker.Submit(func() {
        ck = cpuIntensiveTask(10)
    }).Sync()
    w.Write([]byte(fmt.Sprintln("crc32 (with cpuworker and small task):", ck, "time cost:", time.Now().Sub(ts))))
}

func handleChecksumWithoutCpuWorker(w http.ResponseWriter, _ *http.Request) {
    ts := time.Now()
    var ck uint32
    ck = cpuIntensiveTask(10000 + mathrand.Intn(10000))
    w.Write([]byte(fmt.Sprintln("crc32 (without cpuworker):", ck, "time cost:", time.Now().Sub(ts))))
}

func handleDelay(w http.ResponseWriter, _ *http.Request) {
    t0 := time.Now()
    wCh := make(chan struct{})
    go func() {
        time.Sleep(time.Millisecond)
        wCh <- struct{}{}
    }()
    <-wCh
    w.Write([]byte(fmt.Sprintf("delayed 1ms, time cost %s :)\n", time.Now().Sub(t0))))
}

func main() {
    rand.Read(glCrc32bs)
    nCPU := runtime.GOMAXPROCS(0)
    cpuP := cpuworker.GetGlobalWorkers().GetMaxP()
    fmt.Println("GOMAXPROCS:", nCPU, "DefaultMaxTimeSlice:", cpuworker.DefaultMaxTimeSlice,
        "cpuWorkerMaxP:", cpuP, "length of crc32 bs:", len(glCrc32bs))
    http.HandleFunc("/checksumWithCpuWorker", handleChecksumWithCpuWorkerAndHasCheckpoint)
    http.HandleFunc("/checksumSmallTaskWithCpuWorker", handleChecksumSmallTaskWithCpuWorker)
    http.HandleFunc("/checksumWithoutCpuWorker", handleChecksumWithoutCpuWorker)
    http.HandleFunc("/delay1ms", handleDelay)
    log.Fatal(http.ListenAndServe(":8080", nil))
}

The server cpuworker/example/demo.go is running at an aws instance c5d.12xlarge and with the env GOMAXPROCS set to 16.

$ GOMAXPROCS=16 ./cpuworker-demo

GOMAXPROCS: 16 cpuWorkerMaxP: 12 length of crc32 bs: 262144

The benchmark tool is running at an aws instance c5d.4xlarge. The two machines is running at the same cluster placement group.

# please complete the server IP
SeverIP=x.x.x.x

# cmd 1
while true; do sleep 1;ab -s10000000 -c 100 -n 60000 http://$SeverIP:8080/delay1ms; done

# cmd 2
while true; do sleep 1;ab -s10000000 -c 100 -n 10000 http://$SeverIP:8080/checksumWithoutCpuWorker; done

# cmd 3
while true; do sleep 1;ab -s10000000 -c 100 -n 10000 http://$SeverIP:8080/checksumWithCpuWorker; done

# cmd 4
while true; do sleep 1;ab -s10000000 -c 100 -n 10000 http://$SeverIP:8080/checksumSmallTaskWithCpuWorker; done

Step 1: Killall already existing cmd x, then run the cmd 1 (run the standalone benchmark of delay1ms).

$ ab -s10000000 -c 100 -n 60000 http://$SeverIP:8080/delay1ms
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 172.31.47.63 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:
Server Hostname:        172.31.47.63
Server Port:            8080

Document Path:          /delay1ms
Document Length:        37 bytes

Concurrency Level:      100
Time taken for tests:   0.225 seconds
Complete requests:      10000
Failed requests:        1066
   (Connect: 0, Receive: 0, Length: 1066, Exceptions: 0)
Total transferred:      1538813 bytes
HTML transferred:       368813 bytes
Requests per second:    44413.06 [#/sec] (mean)
Time per request:       2.252 [ms] (mean)
Time per request:       0.023 [ms] (mean, across all concurrent requests)
Transfer rate:          6674.16 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       1
Processing:     1    2   0.4      2       4
Waiting:        1    2   0.4      1       4
Total:          1    2   0.5      2       5
ERROR: The median and mean for the waiting time are more than twice the standard
       deviation apart. These results are NOT reliable.

Percentage of the requests served within a certain time (ms)
  50%      2
  66%      2
  75%      2
  80%      2
  90%      3
  95%      3
  98%      4
  99%      4
 100%      5 (longest request)

Step 2: Killall already existing cmd x, then run the cmd 1 and cmd 2 simultaneously (run the benchmark of delay1ms with a very heavy cpu load which is scheduled by the original Go scheduler).

Current CPU load of the server-side (and please note that the load average is already reaching the GOMAXPROCS, i.e. 16 in this case):

step2-server-load

$ ab -s10000000 -c 100 -n 60000 http://$SeverIP:8080/delay1ms
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 172.31.47.63 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:
Server Hostname:        172.31.47.63
Server Port:            8080

Document Path:          /delay1ms
Document Length:        38 bytes

Concurrency Level:      100
Time taken for tests:   31.565 seconds
Complete requests:      10000
Failed requests:        5266
   (Connect: 0, Receive: 0, Length: 5266, Exceptions: 0)
Total transferred:      1553977 bytes
HTML transferred:       383977 bytes
Requests per second:    316.80 [#/sec] (mean)
Time per request:       315.654 [ms] (mean)
Time per request:       3.157 [ms] (mean, across all concurrent requests)
Transfer rate:          48.08 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:    50  314  99.3    293    1038
Waiting:       11  305 102.5    292    1038
Total:         50  314  99.3    293    1038

Percentage of the requests served within a certain time (ms)
  50%    293
  66%    323
  75%    353
  80%    380
  90%    454
  95%    504
  98%    604
  99%    615
 100%   1038 (longest request)

As we can see, the latency becomes very high and unstable.

Step 3: Killall already existing cmd x, then run the cmd 1 and cmd 3 simultaneously (run the benchmark of delay1ms with a very heavy cpu load which is scheduled by our own scheduler over the original Go scheduler).

Current CPU load of the server-side (and please note that the load average is near the cpuWorkerMaxP, i.e. 12 in this case, and you could set this parameter by yourself):

step3-server-load

$ ab -s10000000 -c 100 -n 60000 http://$SeverIP:8080/delay1ms
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 172.31.47.63 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:
Server Hostname:        172.31.47.63
Server Port:            8080

Document Path:          /delay1ms
Document Length:        37 bytes

Concurrency Level:      100
Time taken for tests:   0.234 seconds
Complete requests:      10000
Failed requests:        1005
   (Connect: 0, Receive: 0, Length: 1005, Exceptions: 0)
Total transferred:      1538877 bytes
HTML transferred:       368877 bytes
Requests per second:    42655.75 [#/sec] (mean)
Time per request:       2.344 [ms] (mean)
Time per request:       0.023 [ms] (mean, across all concurrent requests)
Transfer rate:          6410.35 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       1
Processing:     1    2   0.5      2       4
Waiting:        1    2   0.4      2       4
Total:          1    2   0.5      2       5

Percentage of the requests served within a certain time (ms)
  50%      2
  66%      2
  75%      2
  80%      3
  90%      3
  95%      4
  98%      4
  99%      4
 100%      5 (longest request)

Now the latency becomes fine again even it is running with many CPU intensive tasks!

Step 4: Killall already existing cmd x, then run the cmd 1, cmd 3, and cmd 4 simultaneously (run the benchmark of delay1ms and checksumSmallTaskWithCpuWorker with a very heavy cpu load which is scheduled by our own scheduler over the original Go scheduler).

Current CPU load of the server-side (and please note that the load average is near the cpuWorkerMaxP, i.e. 12 in this case, and you could set this parameter by yourself):

step4-server-load

$ ab -s10000000 -c 100 -n 60000 http://$SeverIP:8080/delay1ms

This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 172.31.47.63 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:
Server Hostname:        172.31.47.63
Server Port:            8080

Document Path:          /delay1ms
Document Length:        37 bytes

Concurrency Level:      100
Time taken for tests:   0.238 seconds
Complete requests:      10000
Failed requests:        1038
   (Connect: 0, Receive: 0, Length: 1038, Exceptions: 0)
Total transferred:      1538857 bytes
HTML transferred:       368857 bytes
Requests per second:    42031.11 [#/sec] (mean)
Time per request:       2.379 [ms] (mean)
Time per request:       0.024 [ms] (mean, across all concurrent requests)
Transfer rate:          6316.39 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       1
Processing:     1    2   0.5      2       5
Waiting:        1    2   0.4      1       5
Total:          1    2   0.6      2       5
ERROR: The median and mean for the waiting time are more than twice the standard
       deviation apart. These results are NOT reliable.

Percentage of the requests served within a certain time (ms)
  50%      2
  66%      2
  75%      2
  80%      3
  90%      3
  95%      4
  98%      4
  99%      4
 100%      5 (longest request)

$ ab -s10000000 -c 100 -n 10000 http://$SeverIP:8080/checksumSmallTaskWithCpuWorker

This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 172.31.47.63 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software:
Server Hostname:        172.31.47.63
Server Port:            8080

Document Path:          /checksumSmallTaskWithCpuWorker
Document Length:        71 bytes

Concurrency Level:      100
Time taken for tests:   0.469 seconds
Complete requests:      10000
Failed requests:        9157
   (Connect: 0, Receive: 0, Length: 9157, Exceptions: 0)
Total transferred:      1889624 bytes
HTML transferred:       719624 bytes
Requests per second:    21333.56 [#/sec] (mean)
Time per request:       4.687 [ms] (mean)
Time per request:       0.047 [ms] (mean, across all concurrent requests)
Transfer rate:          3936.76 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.3      0       2
Processing:     1    4   3.3      3      13
Waiting:        1    4   3.3      3      13
Total:          2    5   3.4      3      13

Percentage of the requests served within a certain time (ms)
  50%      3
  66%      4
  75%      6
  80%      9
  90%     11
  95%     11
  98%     12
  99%     12
 100%     13 (longest request)

The latency of both of them are fine :-)

This demonstration is only to prove that this proposal is feasible and will have obvious benefits to those applications which cares about latency. Moreover, for many applications, throughput is directly affected by latency, so this proposal can also optimize the throughput of those applications

Proposal

Option 1: Bring Go a better scheduler just like Linux Kernel's CFS Scheduler. Support adaptive priority adjustment. Goroutine Setpriority like syscall.Setpriority could also be supported which only influences the speed time passes of the logical clock.

Option 2: Give users the ability to customize their own scheduler without the need to modify their already existing Go codes. I didn't get a very detailed idea yet. Look forward to your excellent ideas.

You might say that one could achieve this goal in the application layer like the way of cpuworker did, but for most applications with already a very large amount of Go codes, such change is just too expensive. And such change in runtime would be more efficient. Therefore, I tend to hope that Go could provide a similar mechanism rather than expect users to achieve this goal by themselves in the application layer.

Thanks a lot for your time to watch this proposal :-)

There are no API changes here so this isn't really a proposal. Retitling to reflect the intent as mentioned in Option 1 above.

CC @aclements @mknyszek @prattmic

Note that any changes to the Go scheduler are complicated by the fact that goroutines are running on top of threads that are managed by the Linux scheduler.

Reworking the scheduler is something we've thought about on and off for quite a while. I haven't taken a close look at your demo scheduler yet, but it looks nifty.

In addition to the issues you describe with IO-intensive tasks, overall CPU overhead of the scheduler is something that comes up frequently as problematic for applications that happen to stress corners of the scheduler. So another point to consider.

Any major change we make to the scheduler will likely need to be paired with an investment in better testing of the scheduler. The current scheduler has evolved to become quite complex, but with few explicit, deterministic tests due to the difficult nature of testing a scheduler deep in the runtime, instead relying on larger Go programs to reveal problems. This makes it difficult to have high confidence in major changes, so investing in better testing is definitely something I'd want to see (whether or not we make large changes to the scheduler, but definitely if we do).

Reworking the scheduler is something we've thought about on and off for quite a while. I haven't taken a close look at your demo scheduler yet, but it looks nifty.

Thanks a lot, @prattmic.

In addition to the issues you describe with IO-intensive tasks, overall CPU overhead of the scheduler is something that comes up frequently as problematic for applications that happen to stress corners of the scheduler. So another point to consider.

Any major change we make to the scheduler will likely need to be paired with an investment in better testing of the scheduler. The current scheduler has evolved to become quite complex, but with few explicit, deterministic tests due to the difficult nature of testing a scheduler deep in the runtime, instead relying on larger Go programs to reveal problems. This makes it difficult to have high confidence in major changes, so investing in better testing is definitely something I'd want to see (whether or not we make large changes to the scheduler, but definitely if we do).

Yes, I agree with you. I investigated the performance of Go priority heap:

 Tested on RockyLinux 8.5, AMD 5950x

 1000,000 max size of the heap

 push with random priority 
    ~ 30 ns/op - 50.9 ns/op
 pop
    ~ 4.2 ns/op - 4.5 ns/op

Such cost might be affordable.

I think maybe we should decouple the specific scheduler implementation from other parts of runtime like Linux Kernel does:

(https://github.com/torvalds/linux/blob/e6251ab4551f51fa4cee03523e08051898c3ce82/kernel/sched/sched.h#L2123-L2186)

// definition of `sched_class` in kernel 
struct sched_class {

#ifdef CONFIG_UCLAMP_TASK
    int uclamp_enabled;
#endif

    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
    void (*yield_task)   (struct rq *rq);
    bool (*yield_to_task)(struct rq *rq, struct task_struct *p);

    void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

    struct task_struct *(*pick_next_task)(struct rq *rq);

    void (*put_prev_task)(struct rq *rq, struct task_struct *p);
    void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);

#ifdef CONFIG_SMP
    int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
    int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);

    struct task_struct * (*pick_task)(struct rq *rq);

    void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

    void (*task_woken)(struct rq *this_rq, struct task_struct *task);

    void (*set_cpus_allowed)(struct task_struct *p,
                 const struct cpumask *newmask,
                 u32 flags);

    void (*rq_online)(struct rq *rq);
    void (*rq_offline)(struct rq *rq);

    struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);
#endif

    void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork)(struct task_struct *p);
    void (*task_dead)(struct task_struct *p);

    /*
     * The switched_from() call is allowed to drop rq->lock, therefore we
     * cannot assume the switched_from/switched_to pair is serialized by
     * rq->lock. They are however serialized by p->pi_lock.
     */
    void (*switched_from)(struct rq *this_rq, struct task_struct *task);
    void (*switched_to)  (struct rq *this_rq, struct task_struct *task);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
                  int oldprio);

    unsigned int (*get_rr_interval)(struct rq *rq,
                    struct task_struct *task);

    void (*update_curr)(struct rq *rq);

#define TASK_SET_GROUP      0
#define TASK_MOVE_GROUP     1

#ifdef CONFIG_FAIR_GROUP_SCHED
    void (*task_change_group)(struct task_struct *p, int type);
#endif
};

This is not only convenient for testing but also convenient for users to add their own scheduler implementation if they want.

Thank you for working on this.

I know nothing of the Go scheduler, but a few questions came to mind when reading your proposal:

I suspect goroutines are much shorter-lived (i.e. more fine-grained) than kernel threads. Does that affect the applicability of the Linux approach in Go?
Would this benefit from grouping goroutines by inferred "type of work" to bootstrap the priority?
If goroutines are indeed short-lived, does that increase the possibility of starving old goroutines?

Maybe MLFQ could be of use? Goroutines of the same priority level would be run in a round robin style, perhaps not too different from the current approach. Demotions would occur after time allotments are used up. And periodic priority boosts could prevent starvation for longer-running tasks.

I am not aware of how the interactions between the two schedulers (go vs OS) is being tested now, but making the two behave similarly might have drastic behaviour causing them to work against each other (I can think of TCP over TCP meltdown). Which means this would be a slow process.

Meanwhile, having a plug and play scheduler implementation would make the burden of switching or making large changes to the scheduler a separated concern with an added benefit of easier testing. Maybe the community will surprise us all with a whole bunch of scheduling ideas. My 2¢

@adyanth While your concern is fair, I believe this may not be the case eventually. Kernel threads are oblivious to Go runtime or what kind of code the thread is running ( ignoring IO wait and CPU fairness ). A new runtime scheduler is in turn a pure userspace construct, how it manages its own goroutines should not compete with kernel scheduler which operates on the other side of system boundary.

Thank you for working on this.

It's very kind of you, @tommie :)

I know nothing of the Go scheduler, but a few questions came to mind when reading your proposal: I suspect goroutines are much shorter-lived (i.e. more fine-grained) than kernel threads. Does that affect the applicability of the Linux approach in Go?

I'm afraid that would not make a big difference. Round-Robin is a kind of scheduling strategy, and so is CFS. The key difference between them is CFS could do some adaptive priority adjustment and try to achieve the Completely Fair scheduling. As we already saw in the demo above, if we run a number of event intensive tasks and cpu intensive tasks together under the Round-Robin scheduling strategy, it would be never fair for event intensive tasks -- because these cpu intensive tasks would consumes much more cpu time than event intensive tasks.

Would this benefit from grouping goroutines by inferred "type of work" to bootstrap the priority?

Do you mean something like go fp() type-of-work-or-priority which could set a constant priority when the goroutine is created? Yes, there would be some benefits. But that would be not very friendly to those Go applications which already have a very large codebase. And furthermore, think about those hybrid tasks which combine event & cpu types together, or tasks might change their types from time to time. So, I think it's best to give runtime this job, then our applications do not need to change a single line of code.

I think the API designing of kernel syscalls related to the scheduler is also worth learning:

setpriority // set the "literal" priority, then scheduler could adaptive adjust the real priority based on this "literal" priority
sched_setscheduler // set scheduling strategy, rr or cfs?
sched_setparam // tuning

If goroutines are indeed short-lived, does that increase the possibility of starving old goroutines?

To be honest, I thought about this problem myself before, and my conclusion is that such problem would unlikely happen in the real production environment. And if it happens, that means the system is already far more overload.

Please take a look on the benchmark below:

func BenchmarkIntChan(b *testing.B) {
    ch := make(chan int)
    go func() {
        for {
            <-ch
        }
    }()
    for i := 0; i < b.N; i++ {
        ch <- 1
    }
}

func BenchmarkIntChanX(b *testing.B) {
    ch := make(chan int)
    go func() {
        for {
            <-ch
        }
    }()
    for i := 0; i < b.N; i++ {
        myCh := make(chan int)
        go func() {
            ch <- 1
            myCh <- 1
        }()
        <-myCh
    }
}

Which yields:

$ # rockyLinux 8.5 & AMD 5950x
$ GOMAXPROCS=1 go test -bench=.
goos: linux
goarch: amd64
pkg: echo
cpu: AMD EPYC-Milan Processor
BenchmarkIntChan         8433139               139.5 ns/op
BenchmarkIntChanX        2393584               495.8 ns/op

Let's assume it takes about 1us on-cpu time to run one of those short-lived goroutines, and we have 32 logical cores, then we could get ~32,000,000 op/sec which means a very huge load. If such applications really exist, that means either the design of such application is not very good, or other components (kernel, network, disk ...) will reach their bottlenecks before our Go. This benchmark also proves the excellent performance of goroutine and chan.

If goroutines are indeed short-lived, does that increase the possibility of starving old goroutines?

@tommie, and furthermore, there are strategies both in CFS and MLFQ (which were mentioned by @tfaughnan above) to prevent this from happening.

Maybe MLFQ could be of use? Goroutines of the same priority level would be run in a round robin style, perhaps not too different from the current approach. Demotions would occur after time allotments are used up. And periodic priority boosts could prevent starvation for longer-running tasks.

Very good point, @tfaughnan. My only concern is that a mature MLFQ may not be much simpler or maybe even more complex than CFS, so I think an interface like the sched_class in kernel could be added to decouple the specific scheduler implementation from the scheduler interface. In this way, Go could even support multiple different schedulers.

I am not aware of how the interactions between the two schedulers (go vs OS) is being tested now, but making the two behave similarly might have drastic behaviour causing them to work against each other (I can think of TCP over TCP meltdown). Which means this would be a slow process.

Hi @adyanth, I agree with @singhpradeep's reply above. Go runtime scheduler is running on the top of kernel scheduler, so they are not in a competitive relationship. Furthermore, they could "work together".

I wrote a similar package to play with different scheduling algos: https://github.com/tommie/go-coopsched

It implements FIFO and something similar to hnes/cpuworker, see function Waitiness.
It uses Go benchmarks.
The "initial results" are meant as a rough analysis to prove to myself the scheduler probably works. :)
It only shows average latency (wait overhead), not percentiles.
It passes the task in a context.Context rather than directly. Overheady, but felt nice.
More notable differences.

Assuming the code works, and that Waitiness is a reasonable simplification of calcEIfactorAndSumbitToRunnableTaskQueue, it does seem to lower the wait overhead (i.e. the time from Sleep being done to the task running again).

I'd like to post here my partial skepticism for this change. I believe as a goal, it might be a good thing. I am not equipped to evaluate whether adopting CFS-scheduler algorithm in the goroutine clockwork is a good idea moving forward or not.

What I do like to point out is that Go has a surprising small number of adjustment knobs considering the complexity of the runtime (we do see a lot of knobs tweaking the compilation or part of the behavior of the standard library though), I might be forgetting something here, but to the best of my recollection we have just GOGC, GOMAXPROCS, GORACE, and GOTRACEBACK - and these are all scalar values.

That leads to a significantly easier experience from operations standpoint. Therefore, adopting a custom goroutine scheduler interface, or adding "just another switch" for an alternative operation mode might shave against the grain in the long term stance of Go having as few knobs as possible.

In other words, CFS Scheduler good. More knobs, I am not so sure.

Agreed about Go's lack of knobs being a great feature.

Here's a self-contained cpuworker/example/demo.go. The percentile printout isn't as nice as ab, but it seems to show @hnes's point just by looking at the value ranges.

I wonder if this could go as an add-on to the existing runq FIFO. If priority levels are fairly coarse, the lowest priority g:s could be in runq, while the others are in a priority queue, sitting between runq and globrunq. That, way, the top of the priority queue could just be batch-taken into runq as needed. Introducing that priority queue would leave the field open for experimenting with different algorithms for selecting priority, and yet would be easy to disable until the feature is finished.

If the internal working of the Go scheduler is to be reconsidered then an obvious question arises. Should any new scheduler take account of the heterogeneous nature of modern CPUs. Arm has had big.LITTLE for over ten years, Apple's M1 processors have Firestorm and Icestorm cores and Intel Alder Lake CPUs have Golden Cove high-performance cores and Gracemont power-efficient cores.

I too appreciate the lack of "knobs" in the Go runtime but the asymmetric nature of current CPUs may require an extra knob or two to schedule Goroutines as efficiently as possible.

Naive question: Would this interact at all with package sync's fairness algorithms?

I think these two behaviors of the Go scheduler must fix in the new version (If any schedule):

randomly scheduled or rescheduled goroutines: A goroutine may be arbitrarily and randomly scheduled or rescheduled on different running threads(CPU cores), i.e., the same piece of code will be called from different threads over time, even without evolving the go keyword. In this case, that talks more here, in many logic developers need to care more about syncing the operations that produce by this random behavior. for example, when you want to implement workers logic to handle income network services, you can't be sure that if you have e.g. 10 workers for a user connection, the workers run sync in order, so need to use locking or atomic mechanism to implement connections metrics, But if goroutines can be more predictable it is easy to forget about concurrency because we are sure that all same connections workers run on same CPU cores. something like network Tx and Rx queue in kernels that you can easily predict same sender packets just goes to the same queue in same CPU core over the times. No packet can process before the last one finishes its path logic even if the CPU core is scheduled to another process and back again later.
Hybrid kernel/userspace thread manager/scheduler: I know for blocking syscalls we need some mechanism to block a thread that needs to kernel know about threads, but In fundamental thinking, Why hybrid mechanism be the first-class citizen? Why not pure userspace scheduler be the first-class citizen? and have some mechanism in some rare situation to schedule a new OS thread of course if OS kernel support it (some new kernels like Unikernels suggest dropping kernel-level threads manager at all in favor of pure userspace threads). I suggest just having one thread per CPU core and having a priority mechanism for goroutines. Also, as said above a goroutine can lock to one CPU core.
Some other problems like #14592 must be considered in the re-written Go scheduler.

Both suggestions don't change the Go behavior and Go apps easily work out of the box, just change the way advanced developers write frameworks to developed software by Go in a better way.

golang / go