occlum / ngo

Next-Gen Occlum, a work-in-progress fork of Occlum that is optimized for the next-generation of Intel SGX (on Xeon SP processors)

Other

33 stars 17 forks source link

[RFC] Add scheduler to the Rust async runtime #18

Open ShuochengWang opened 3 years ago

ShuochengWang commented 3 years ago

The Problem

NGO uses Rust async / await to reconstruct the code and realize asynchronous. Because Rust async / await needs to be executed through Rust async runtime, NGO has implemented a basic version of Rust async runtime, namely async-rt.

With async-rt, NGO implements M : N : P in-enclave scheduling. Specifically, there are M user threads, N LibOS coroutines, and P host threads. The executor of Rust async runtime runs on P host threads (that is, P Linux threads). The executor schedules N LibOS coroutines (that is, N Rust async tasks). N LibOS coroutines contain M user threads and N - M LibOS system tasks.

However, there is no scheduler for async-rt at present. The executor of async-rt can only run tasks in its local queue. This will lead to the following problems for NGO:

Unbalanced load: some host threads are busy, while others are idle. This may result in increased program latency.
Waste of computing power: if the host thread is idle, it will continuously poll the local queue, which will waste CPU computing power.
No priority: in NGO, some tasks have lower priority, such as some LibOS background tasks, but async-rt can't distinguish priorities at present.

In addition, in async-rt, if the task is in the pending state, the corresponding waker will be set. Invoking waker.wake() will rejoin the task to the local queue. If wake() is invoked repeatedly (It is feasible at the API level), the queue will contain multiple identical tasks, which will affect performance and scheduling.

Design Goals

Add scheduler to the Rust async runtime async-rt and make async-rt a production-level Rust async runtime with no_std. The scheduler should meet the following requirements:

[ ] load balancing
[x] save computing power
[ ] support priority
[ ] avoid repetitive tasks

The priorities of these requirements decrease in turn. We will implement these requirements in order of priority.

ShuochengWang commented 3 years ago

The results of the discussion with @tatetian :

idle vcpu (the thead of the async runtime) should sleep and wakeup later. When the vcpu's run_queue is empty, the vcpu need sleep some time, eg, sleep 50ms. We can use eventfd (better) or futex in occlum to implement this feature. However, when the run_queue is empty, there may be some pending tasks on this vcpu, other vcpu need to invoke wake(), insert the task to the run_queue, and wakeup the corresbonding vcpu.
performance problem of io_uring callback. In current ngo implementation, we set a sched_callback (io_uring callback) before init async runtime, then each vcpu will invoke sched_callback (eg, poll io_uring completion queue) in each event loop. Actually, It is not necessary for all vcpu to poll io_uring queue. We can use only one vcpu, or several vcpu to poll the io_uring queue. We can deprecated sched_callback, and treat polling io_uring queue as a async task.
[RFC] Config the number vCPUs #16 . Implement this RFC
sleep function for user task. User task maybe want to sleep some time, eg, timer. Since async-rt is a no_std runtime, there is no time function in no_std env. We can implement sleep function in ngo libos.

ShuochengWang commented 3 years ago

Related Work

async-executor
- global queue + local queues + steal. All tasks will be firstly pushed to global queue. When a thread tries to get a task, it will find tasks from its local queue. If local queue is empty, the thread will try to steal half of the tasks from the global queue. If the global queue is still empty, then the thread will try to steal tasks from other thread's local queue (select a random thread and try to steal half of the tasks from the local queue. If failed, try next thread.)
crossbeam-deque
- Concurrent work-stealing deques. most commonly used in work-stealing schedulers.
- Worker + Injector + Stealer. Each thread corresponds to a worker. And each worker has a corresponding stealer, this stealer can steal tasks from its worker. Injector is a queue shared by threads. Hence, worker is the local queue, injector is the global queue. get tasks from local worker's queue => get tasks from global injector's queue => get tasks from stealer (other worker's queue)
tokio
- local queue + Inject + lifo_slot + Steal. Similar to crossbeam-deque, inject is global queue. 1. Tasks will be pushed to local queue. if local queue is full, push tasks to global queue. 2. Get tasks from the local queue first most of the time. 3. each thread has a lifo_slot, which with higher priority than local_queue. 3. If local queue is empty, steal tasks from other local queues first. if steal failed, steal from global queue.

ShuochengWang commented 3 years ago

Our Design Goals

A Rust async runtime with no_std, for LibOS, In particular, for SGX LibOS.

Take ngo as an example, ngo need rust async runtime with following features:

Idle thread sleep
Load balancing
Support priority

Idle thread sleep (Already supported)

If one thread is idle, this thread will sleep. When a new task is ready, the sleeping thread will be waked up.

Load balancing

Assign tasks to each thread in a balanced manner
Threads with fewer workload should execute some workload on other threads
Try to avoid starvation
Can not violate the cpu affinity.

Support priority

There are different kinds of tasks in LibOS, e.g. user tasks and LibOS tasks. User tasks generally have higher priority, while LibOS tasks maybe running in the background and with lower priority.

Distinguish different kinds of tasks.
Support priority in the scheduler
Shutdown background tasks correctly. (background tasks may be infinite loop pattern, we need shutdown it correctly before exits)

ShuochengWang commented 3 years ago

Design Overview

Firstly, thanks to tokio's blog https://tokio.rs/blog/2019-10-scheduler, those ideas are very usefull.

Basic Idea

local queues: each thread has a local queue, in particular, a bounded mpmc / spmc queue.
global queue: a unbounded mpmc queue. Since the local queue is bounded (fixed size), the global queue is used to store tasks when the local queue is full. The implementation of global queue in tokio is a intrusive linked list guarded by mutex.
steal: if one thread's local queue is empty, it can steal tasks from other threads' local queue and global queue.

Task

Task priority

Task priority affects how to select tasks in the thread.

default (mid) : If not specify priority, use default. User tasks usually are default priority.
high
low

Task type

Task type affects how to assign task to threads.

default : if not specify type, use default. User tasks usually are default type.
blocking : a task that will block for a time. If one thread executes a blocking task, other tasks on this thread may starve. Hence, when accept a blocking task to one thread, we will move all tasks on this thread to global queue to avoid starvation.
background : e.g. LibOS tasks. These tasks may be keep running until runtime shutdown.

CPU affinity

In our async runtime, the thead is called vcpu. Since user can bind threads to specific cpu set in OS, we allow user to bind tasks to specific vcpu set.

If one task is bound to a vcpu set, then the task can only run in those vcpus. vcpu outside the set can not steal the task.

Run queue

Each thread has three local queues. All threads share one global queue.

high priority queue
mid priority queue
low priority queue
global queue

ShuochengWang commented 3 years ago

Assign tasks to threads

When accept one task, we need to decide which thread to run the task.

In the assign algorithm, we consider following factors:

If the task is new or not: If the task is not new (a wake up task), that is, the task has been scheduled before, we will try to schedule the task to the previous thread.
cpu affinity: we need meet the cpu affinity.
task type: If the type is blocking, we need try to find a thread that has relatively few tasks, and we need to move those tasks to other threads. And we need consider the cpu affinity.
local queue lengh: according to the task's priority, we find a thread with short corresponding local queue.
wait time: when each task is assigned, we record a assign_tick. And we record a global_tick. wait_time = global_tick - queue.front().assign_tick. we find a thread with short wait time.

ShuochengWang commented 3 years ago

Select one task to run

high probability: select from high_priority queue.
medium probability: select form mid_pri queue.
low probability: select from low_pri queue.
very low probability: select from global queue.

If first try failed (the queue is empty), try other local queues.
If still failed, try to steal.

Steal

select one random thread to steal half of tasks.
do not steal tasks with invalid cpu affinity.
If steal failed, try next thread. (try k times)
If steal still failed, try steal global queue.

ShuochengWang commented 3 years ago

Optimization: Temporarily increase priority

If a mid_pri default_type task is wake up, insert it to high_pri quue temporarily to reduce delay.

Optimization: Throttle stealing (tokio's opt)

Limit the max number of concurrent threads performing steal to reduce contention.