python-trio / trio

Trio – a friendly Python library for async concurrency and I/O
https://trio.readthedocs.io
Other
6.13k stars 335 forks source link

Design: alternative scheduling models #32

Open njsmith opened 7 years ago

njsmith commented 7 years ago

Currently we use a simple round-robin scheduling strategy. It isn't technically FIFO, because we randomize the order of execution within each "batch" in an attempt to force people not to make too many assumptions about the scheduling strategy :-). But basically FIFO in its sophistication.

Can we do better?

There's a rich literature on fancy scheduling, e.g. the weighted-fair queuing used by the Linux kernel's CFS ("completely fair scheduler"). And there are a lot of challenges in building async applications that come down to scheduling issues (e.g., #14 and https://stackoverflow.com/questions/40601119). But AFAICT a lot of the existing scheduling literature assumes that you have a pre-emptive scheduler; there's very little out there on applying these ideas to cooperative scheduling systems. in some ways the network packet scheduling literature is more relevant to us, because packets come in different sizes and the scheduler doesn't get to change that. (OTOH, packet scheduling algorithms tend to assume you can look into the future and predict how long a packet will spend transmitting, whereas when we start a task step we don't know how long it will run before yielding.)

Does it matter though?

It's not entirely clear that "fairness" is actually the solution to the problems linked above! Fair task scheduling is in part about negotiating between somewhat adversarial users (e.g. different TCP connections fighting for their share of a link), which isn't as obvious a fit to the tasks in our system. Though OTOH those tasks may be doing work on behalf of different competing users. And scheduler algorithms only matter when a program is CPU-bound, which hopefully trio programs usually aren't. But OTOH even programs that are I/O-bound overall will become CPU-bound in bursts, and it's exactly when things are melting down under load that you'd most like to handle things gracefully. OTOOH the real solution is often going to be something like load shedding or otherwise rate-limiting incoming work; better scheduling isn't a silver bullet to fix load issues.

So, I really don't know whether this is actually a good/useful idea for trio or not. But it might be a place where we can make a substantial difference to trio programs' usability in a central, principled way, so that seems worth exploring!

Options

In principle, a WFQ scheduler is actually very simple to implement (see below). What's not so clear is whether this would actually help in practice. There are two obvious issues:

Maybe what we really need is better knobs to let users set the priority of different tasks. (WFQ makes it easy to implement relative priorities, and relatively easy to implement hierarchical scheduling policies; it's also pretty straightforward to implement strictly tiered priorities like the Linux kernel's realtime priorities.) But "here's 100 knobs to tune, try tweaking some whenever your server performance degrades" is a pretty bad UX too. There's something to be said for the predictability of FIFO.

In general, this is a deep problem that will definitely require experience and data and visualizations of real apps on top of trio.

Prior art

Ironically, the Python GIL is essentially a cooperative scheduling system (though a weird one), and probably the one with the best public analysis! Interesting documents:

The classic Mac OS used cooperative scheduling heavily. This was known as the "thread manager". I'm very curious what their scheduling strategy looked like, but I don't know how well documented it is. The "Inside Macintosh: Thread Manager" book (PDFs are floating around on the web) might have more details. [Edit: there's now some notes on this on the reading list wiki page.]

Alternatively...

Alternatively, if we decide that we don't want a fancy scheduling system, we could go the other way, and actually guarantee deterministic FIFO scheduling, on the theory that determinism is generally a nice thing if you aren't losing anything else to get it.

njsmith commented 7 years ago

Weighted fair queuing

Weighted fair queuing (WFQ) is super elegant and simple, but I found it hard to work this out from the literature (partly because our context is a bit weird). So here are my notes on how it works.

The goal is to emulate a ideal multi-tasking CPU (in the academic literature this is called the "generalized process sharing" (GPS) model). When an ideal multi-tasking CPU has N tasks that want to run, then it splits itself into N mini-CPUs that each run at 1/N times the speed of the original CPU, and then each task runs on its own mini-CPU. So if there are N runnable tasks, then over any time span of T seconds they each get (T/N) seconds of CPU time. It's very simple, but, of course, reality doesn't work this way – we only have 1 CPU and our tasks actually have to take turns. So the idea is that we keep track of how much CPU time each task got on the real CPU, versus how long it should have gotten on its mini-CPU, and try to make them match up over time on average.

So this suggests a naive implementation of WFQ: whenever a task gets added to the run queue (i.e., on an ideal CPU it would be running, but in reality it's sitting and waiting for it's turn), then start a timer to track how much time it should have run for. Each time it ticks, the counter assigns the task another 1/(current length of run queue) ticks worth of CPU credit, since that's how much it should have gotten over that tick interval. (Notice that this might vary from tick to tick, because the length of the run queue varies over time.) And then whenever a task actually runs, we watch how much CPU time it uses, and decrement the accumulated credit by the appropriate amount. Now our scheduling policy is simple: at each schedule point, pick the task that has the most credit accumulated, and start running it.

Theoretically, this works great! That's all there is to it. BUT, the problem is that this implementation is slow, because if we have N tasks in the run queue then on every tick we have to do an annoying O(N) loop to scan through and update their credit balance. So we want to keep that behavior, but find a way to implement it that's more algorithmically efficient. One impulse might be to start looking for heuristics, but it turns out that we don't need any – there is a beautiful trick that lets us make this more efficient while remaining exactly fair!

Here's the trick: we introduce a "virtual clock" (or vclock for short). Our virtual clock ticks in virtual seconds (vseconds). The vclock always ticks at a rate of 1 vsecond per N wall seconds, where N is the number of runnable tasks – so this means that it's constantly speeding up and slowing down as tasks enter and leave the run queue. But don't worry about that – it will turn out that we don't have to keep track of the relationship between vtime and realtime in any kind of detailed way. The relationship is well-defined, but like, pretend that someone else is doing all the annoying bookkeeping work to keep track of it.

Instead of doing that, we're going to take advantage of a simple invariant. Imagine that we had an ideal "GPS" CPU and two tasks running on it, so they're each receiving 50% of the CPU time. This means that over 2 real seconds, they each receive 1 second of CPU time. It also means that over 2 real seconds, the vclock advances by 1 vsecond. In fact, this holds in general: for an ideal GPS CPU, over any time period where a task gets a total of 1 real second of CPU time, the vclock advances by 1 second. This works for arbitrary numbers of tasks, and it even works if the number of tasks fluctuates over time: if a new task starts running then every task starts getting a smaller proportion of the CPU, and the vclock slows down by exactly the same amount, so it still works; if a task goes to sleep then something similar happens in reverse.

It's kind of brain-melty, but it suggests a shockingly simple algorithm, which is efficient and gives perfectly fair scheduling. Give each task a vtime counter. This counter shows the time on the vclock at which the task starts deserving more CPU time. At the start, we initialize these all to the same arbitrary value. Then:

That's it!

One way to think about this: we're simulating our ideal CPU running these tasks over virtual time, and our simulation has a single linear timeline measured by the vclock; each task's vtime counter shows the point on this shared timeline that our simulation of this task has reached.

Another, way more intuitive way to think about it: we just keep track of how long each task has been allowed to run, and pick the one where this value is smallest. However, the limitation of this simple intuition is that it doesn't help you deal with initializing the value when a new task enters the run queue.

Open question: handling new + woken tasks

For me this is the biggest open question. Notice that in the algorithm summary above, I skipped over the question of how to initialize the vtime of a newly spawned task, or what to do with the vtime of a task that's been sleeping and just woke up. The classic answer is that in both cases, you initialize the new vtime to min(task.vtime for task in runqueue). There's no attempt to heuristically reward tasks for sleeping or punish tasks that went over their time, because (a) in the GPS model they get their fair share whenever they ask for it anyway, and (b) in a pre-emptive setting then they can't go over time, so there's nothing to punish for.

For new tasks, this makes sense for us too. Though there are alternatives, e.g., from a theoretical point of view it seems natural to check the parent's vtime at the moment they call spawn, and use that to initialize the child's vtime. It may not matter much either way.

For woken tasks, this isn't going to work, because it would mean that a CPU hog could block the event loop for 10 seconds, then sleep for 0.1 second and start over with a blank slate. After playing with a few options on paper, I think a plausible thing to try might be: when a task wakes up, set its vtime to min(task.vtime for task in runqueue) UNLESS this would make its vtime go backwards, in which case leave it where it is. I don't feel like I have a great intuition here though.

Refinements

Hysteresis

For a pre-emptive scheduler, you want to add some hysteresis so as to avoid the situation where you have two tasks with the same vclock, you pick one, run it for 1 nanosecond, now the other task has a smaller vclock so you context switch, run that one for 1 nanosecond, now the first task has a smaller vclock again... basically the pre-emption rule is that the current task runs until either it sleeps or its vclock becomes > (min(other task's vclock) + some slop). In CFS this slop is called the "granularity" and is a tunable, but is usually set to a few milliseconds. For a cooperative scheduler, it's not clear whether this matters, since we rarely have the option of immediately re-running the same task, and there's an implicit maximum context-switch rate imposed by the granularity of individual task steps.

Priorities / niceness values

If you want a task to receive x% of its fair share, then when incrementing its vclock, first divide by x%. (So if a task with a 50% share runs for 1 second, then its vclock increments by 2 vseconds.)

Hierarchical weighted fair queuing

The idea here is that you split tasks into groups (possibly nested), and then use WFQ between and within groups. So if you have two groups you might say that they each collectively get 50% of the CPU time, and then within each group the subdivide their 50% fairly. Of course if the groups are of equal size then this is just the same as WFQ without groups, but if the groups are different sizes then it's different.

The intuition behind this is that our idea of fairness might not have the same granularity as our tasks (which are kind of an implementation detail). For example, consider a full-duplex protocol where both sides can read and write simultaneously. In trio this would generally involve making two tasks per session. And let's say we have two session, so four tasks total: send-to-A, recv-from-A, send-to-B, recv-from-B. And now let's say that for whatever reason, A is sending and receiving at full bandwidth, and B is sending at full bandwidth but not receiving. So we have 3 active tasks, and regular WFQ will end up allocating 66% of our server CPU to A, while B gets only 33%. Maybe it would be better if we put these tasks into an A group and a B group, so they each got 50% of the resources.

Implementing HWFQ is a bit more complicated, but not too much. I'll defer to the literature for the details though.

Actually measuring usage

For an OS kernel, it's trivial to measure how long a task has been running for, because they control the whole system. For us it's not quite so obvious – we can ask the OS what time it is when the task starts and when it ends and then subtract them, but if our process gets scheduled away, or some other thread steals the GIL, or a GC collection runs, then whatever task happened to be running at the time will get stuck with the bill. We could potentially measure CPU time instead (Linux: clock_gettime(CLOCK_THREAD_CPUTIME_ID, ...); Mac OS: thread_info, Windows: GetThreadTimes). Potential downsides: (a) I don't think Python exposes this (though it'd be pretty trivial to get through ctypes/cffi), (b) it won't count any time spent actually blocked doing I/O, if the task does blocking I/O for some reason. I mean hopefully it won't, but... if nothing else it can incur page faults. Though possibly we shouldn't penalize for page faults anyway, idk.

I guess virtualized Linux systems have basically the same problem and they seem to do OK.

A related issue is the resolution of these timers – if a task runs for a very short period, can we measure its CPU time accurately? I suppose if it comes back as 0 then we can just reschedule it and keep the clock running :-).

References

njsmith commented 7 years ago

Noting for future reference, a sketch of how to implement ParkingLot for WFQ:

The idea is that ParkingLot should be fair when picking the next task to dequeue... but if we're using WFQ, then this is a bit tricky. If there are multiple tasks in the ParkingLot that are all potentially runnable (according to the scheduler), then it should wake whichever one parked first (FIFO rule). But if some are potentially runnable and some are not, then it doesn't matter what order the non-runnable tasks arrived in, we should wake one of the runnable ones. (Conceptually, the non-runnable tasks are visitors from the vfuture, and haven't really parked yet.)

One efficient approach would be to keep two internal queues: one sorted by vtime, and one sorted by arrival time. When we want to dequeue, first inspect the vtime queue for any tasks that have become potentially runnable, and move them into the arrival-time queue (which is sorted by the time they called park, not the time they get moved into this subqueue). Then dequeue from the arrival-time queue until it's empty; then switch to dequeuing from the vtime queue. But, this has two limitations: (a) it doesn't play well with hierarchical WFQ, (b) by greedily dequeuing at unpark time, we potentially give the wrong answer. Imagine that the arrival-time queue is empty, so we dequeue something from the vtime queue. But then immediately after this – and well before the task we just scheduled actually becomes eligible to run – a new task parks, with an even lower vtime. Ideally what we should do is "unschedule" the first task and schedule this first one instead. But this is tricky to implement...

Perhaps a better way to think about it is: when a ParkingLot has pending unparks, then we expose all of the waiting tasks to the scheduler, with their FIFO position used as a tie-breaker between tasks that have the same vruntime. Then as the scheduler picks tasks to actually run, we decrement the count of pending unparks, and when it hits zero we hide the remaining waiting tasks from the scheduler again.

Putting a big pile of tasks on/off the scheduler like this might be very difficult to do without quadratic costs, though. I think for regular WFQ this is probably avoidable (since there we know which task is going to unpark if any; the only possibility is that it might get pre-empted); for HWFQ it's... challenging.

njsmith commented 7 years ago

The echo client in the tutorial is an interesting example of a case where ideally we might want to support task priorities (and maybe even hierarchical priorities).

The tutorial makes a big deal out of how splitting sending and receiving into two separate tasks avoids deadlocks and preserves good latency, but there's still a nasty case it could potentially get into where if the receiver just isn't scheduled often enough relative to the sender, then latency could accumulate up to the limits of the network buffer. (So not an infinite amount – eventually the sendall will start blocking and giving the recv priority – but this could involve a fair amount of latency, especially if we aren't using a fancy network with AQM and BBR.)

One solution would be to explicitly declare that the receiver task always has priority over the corresponding sender task. (Hierarchical scheduling would make it possible to declare exactly this; without hierarchical scheduling, we could still say that all receiver tasks have priority over all sender tasks.)

I'm not sure how typical the echo client is of real-world scheduling problems, but it's always nice to have simple concrete examples!

njsmith commented 7 years ago

There's a little bit of discussion of better-than-FIFO scheduling policies in this thread on async-sig. This also spurred me to read a bit more, and I've discovered that what I call WFQ above might be what other people call "SFQ", as e.g. discussed in this chapter. (The usual cite for SFQ is GGV96, A hierarchical cpu scheduler for multimedia operating systems.)

There's also Earliest eligible virtual deadline first, which is what Con Kolivas cites as the inspiration for BFS/MuQSS (though I'm not sure those schedulers as implemented correspond to any well-articulated model).

I don't currently understand how and why SFQ and EEVDF differ.

matham commented 7 years ago

[@njsmith moved this comment to a new issue: #284]

pfalcon commented 6 years ago

Currently we use a simple round-robin scheduling strategy. It isn't technically FIFO, because we randomize the order of execution within each "batch" in an attempt to force people not to make too many assumptions about the scheduling strategy :-).

That's weird. Usually, determinism is a valued property in scheduling, as it offers "higher" fairness. As an example, another asyncio alternative had to go out of its way to offer deterministic scheduling, because otherwise it was "randomized" in a way up to not being fair: https://github.com/micropython/micropython-lib/issues/140 ("out of its way" there, is to spend 33-50% more memory on the queue structure, with the whole implementation tightly optimized for memory usage).

njsmith commented 6 years ago

Hi @pfalcon! Can you give more details on the problems you ran into with lack of fairness in uasyncio? (Interesting project by the way, I hadn't encountered it before.) I read the linked bug report but couldn't figure out what exactly the problem was, or what degree of unfairness the prior algorithm allowed.

To make sure it's clear: trio's current scheduling algorithm is randomized, but is still careful to put a strict bound on worst-case unfairness. If you have two tasks executing 'await sleep(0)' in a loop, then strict FIFO would give you the schedule ABABABAB... Trio doesn't guarantee that, but the worst that the current scheduler would allow is ABBAABBAABB... (and in practice since it's randomized you'll get something in between those two schedules – it's easy to try this with a print in the loop and see what happens).

pfalcon commented 6 years ago

Can you give more details on the problems you ran into with lack of fairness in uasyncio?

uasyncio initially used heapq module for scheduling queue, with the usual trick of letting few tasks be scheduled for the same time unit : https://docs.python.org/3/library/heapq.html#priority-queue-implementation-notes (store entries as 3-element list including the priority, an entry count, and the task.). Having 3-element entry instead of 2-element meant spending 50% more memory on it (with MicroPython heaps counted in kilobytes), and was one of the big motivations to rewrite it to dedicated "timeq" modules.

But then turned out, that because heap data structure doesn't provide "stable sort" order, then if certain number of tasks are scheduled for the same time unit, then some of them don't get scheduled fair. If the other tasks also called uasyncio.sleep(0), then they got rescheduled at the same time unit (millisecond for uasyncio) and before the "unlucky" task(s). As an example, if we scheduled tasks 1, 2, 3, 4, then only tasks 1, 2, 3 had a chance to run, and 4 was always dangling at the end of queue and barely had a chance to run.

There're of course few conditions which lead to that, namely: a) finite precision of time scheduling; b) not really true randomized, but rather biased, "sorting". But the point is still: having a deterministic scheduling is boon, which helps to fight even unfortunate circumstances like above. So, in uasyncio, if tasks were scheduled in the order 1, 2, 3, 4, they will run in that order (and if they just yield to the loop each (not timed wait, no I/O), they will keep running in that order).

pfalcon commented 6 years ago

(Interesting project by the way, I hadn't encountered it before.)

You can't imagine how I find Trio (mostly its docs) insightful. Discovering Trio actually prompted me to compile the timeline: https://github.com/pfalcon/micropython-lib/wiki/AsyncioAlternativesTimeline . You can compare my rhetoric from https://groups.google.com/d/msg/python-tulip/JA0-FC_pliA/knMvVGxp2WsJ with what you write in the first part of https://vorpus.org/blog/some-thoughts-on-asynchronous-api-design-in-a-post-asyncawait-world/ .

njsmith commented 6 years ago

If the other tasks also called uasyncio.sleep(0), then they got rescheduled at the same time unit (millisecond for uasyncio) and before the "unlucky" task(s). As an example, if we scheduled tasks 1, 2, 3, 4, then only tasks 1, 2, 3 had a chance to run, and 4 was always dangling at the end of queue and barely had a chance to run.

Ah, right, that kind of task starvation is definitely bad. Curio also had a number of issues like this before I fixed them. Trio avoids this by dequeuing tasks in batches: first it collects up everything that's ready to run right now, then it runs all those, and then only after it's finished that batch does it go back and look for more things to run. So if tasks 1, 2, 3 reschedule themselves immediately, that's fine, they get put in the queue, but we don't look at that queue until after we've finished running task 4 and go back to create our next batch. And the randomization happens within each batch, so it can't create any kind of long-term unfairness – at the boundaries between batches, every task has (deterministically!) run exactly as many times as it would have with strict FIFO.

Code

https://groups.google.com/d/msg/python-tulip/JA0-FC_pliA/knMvVGxp2WsJ

Well, now I feel even sillier for having missed uasyncio before! It's definitely exciting to find others exploring the space of asyncio alternatives...

pfalcon commented 6 years ago

Yeah, I also didn't hear about Trio, nor about async-sig mailing list, I now subscribed to it.

Anyway, as you created a specific ticket for the MicroPython stuff, #351, it makes sense to keep the related discussion there.

njsmith commented 6 years ago

Another interesting real-world example of scheduling difficulties, from #twisted:

<Matthew[m]> 18:09:01> meejah: matrix.org's matrix homeserver (synapse, written in twisted) is very overloaded with traffic atm and we were seeing reactor starvation. frustratingly this was causing incredibly slow HTTP responses to clients, seemingly due to the network writes being starved of reactor time
<Matthew[m]> 18:09:14> which was causing clients to time out, causing them to hammer, causing a feedback death spiral.
<Matthew[m]> 18:09:49> hence we were wondering if there was some way you could hint to twisted's scheduler to not let CPU-heavy deferreds starve out "just send this network data to the client" deferreds.
<Matthew[m]> 18:10:26> (the query is 2nd hand though; i'm asking on behalf of the guy who was desperately trying to get the server back up at the time)
<runciter> 18:10:58> Matthew[m]: generally, no, as Deferreds have nothing to do with the reactor
<runciter> 18:11:40> there are different kinds of reactor starvation
<runciter> 18:12:12> for example, the callback queue can grow to the point where file descriptor events aren't serviced
<Matthew[m]> 18:13:23> i may be using the wrong terminology by blaming deferreds from contributing to the starvation then
<Matthew[m]> 18:13:42> is there a way to measure the reactor health to see why writes would be trickling?
<Matthew[m]> 18:14:19> (we were seeing 30 minutes to send 1MB of precalculated data)
<runciter> 18:14:34> under heavy load?
<runciter> 18:15:47> generally, no, twisted does not have APIs to collect metrics beyond whatever python gives you
<runciter> 18:16:14> could i reproduce this locally if i were so inclined?
<Matthew[m]> 18:28:21> yeah, under heavy load
<Matthew[m]> 18:28:25> we don’t have a test case
<runciter> 18:28:46> Matthew[m]: there's that load test script - is it hard to generate the 1MB of data? 
<Matthew[m]> 18:29:19> i will see if we can provide one in the morning, but in the end we got it back under control by firing up more worker processes and tuning caches to speed up everything else it was doing
njsmith commented 6 years ago

Just learned some more about how twisted deals with CPU-bound tasks.

The recommended method is to use twisted.internet.task.cooperate, which provides a simple albeit kinda ad hoc API for computationally bound tasks. (You write your CPU-bound task as a generator, and use yield to mark the places where you want to sleep.) Previously I thought this was just a convenience API, but there's actually a bit more to it: all tasks running under cooperate are scheduled, well, cooperatively, using an interesting algorithm. On each event loop tick, the cooperate scheduler runs once, and iterates through all cooperate tasks -- but if this is taking a while, then it stops and leaves some for the next tick. You can think of it as arranging all the cooperate tasks in a circle, and on each event loop tick it goes through as many as it can before a stopping condition is reached. Then on the next tick it picks up again where it left off.

The default stopping condition is: if it's been less than 10 ms, run another task. (Or put another way: on each tick it runs as many tasks as can fit in 10 ms, plus one.)

So basically this is a way of scheduling CPU-bound tasks as a group, and then throttling that group as a whole.

I suspect that a smart scheduler could do some of this automatically, on the assumption that CPU-bound tasks will use more CPU than IO-bound tasks and thus get automatically throttled for fairness. Some way to explicitly mark tasks as lower priority might also be useful. It's also interesting to think about what the twisted approach actually ends up doing, in this conceptual framework – it's not quite like either a Linux nice value (= the weight in WFQ) or a Linux realtime priority (= lower priority tasks don't run at all while a higher priority task can run).

njsmith commented 5 years ago

This article on "scheduling capabilities" also looks neat! I haven't read it yet, and it's probably more about static "realtime"-ish priorities than the WFQ-type stuff discussed above, but I want to at least capture the link: https://dl.acm.org/citation.cfm?doid=3190508.3190539

njsmith commented 4 years ago

Another likely example of strict-checkpoint-fairness causing problems: https://gitter.im/python-trio/general?at=5f526634dfaaed4ef52ef17f

Best-guess analysis: https://gitter.im/python-trio/general?at=5f536ec9a5788a3c29d5f248

[Edit: turns out that this was probably a false alarm, though the scenario described there is still plausible, even if it's not what was going on here.]