Implement hierarchical scheduling for SCHED_DEADLINE (H-CBS)

jlelli / linux

SCHED_DEADLINE. An implementation of the popular Earliest Deadline First (EDF) scheduling algorithm or the Linux kernel. Fork tracking upstream. Intended for further development and issues tracking.

https://en.wikipedia.org/wiki/SCHED_DEADLINE

Other

0 stars 1 forks source link

Implement hierarchical scheduling for SCHED_DEADLINE (H-CBS) #4

Open jlelli opened 6 years ago

jlelli commented 6 years ago

Implement hierarchical RT scheduling by nesting the SCHED_RT fixed-priority scheduler within SCHED_DEADLINE reservations, namely allowing for groups of tasks to be scheduled within a SCHED_DEADLINE reservation, choosing tasks within each group according to their RT priorities.

+-- SCHED_DL scheduler
   +-- SCHED_DL task 1 <rt=..., dl=..., period=...>
   +-- SCHED_DL task 2 <rt=..., dl=..., period=...>
   +-- ...
   +-- SCHED_RT group 1 <rt=..., period=...>
   |  +-- T1 <rtprio=...>
   |  +-- T2 <rtprio=...>
   +-- SCHED_RT group 2 <rt=..., period=...>
   |  +-- T3 <rtprio=...>
   |  +-- T4 <rtprio=...>
   ...

jlelli commented 6 years ago

First RFC posted on LKML: https://lwn.net/Articles/718645/

jlelli commented 6 years ago

Rebased on tip/master by @lucabe72 https://github.com/lucabe72/LinuxPatches/tree/Hierarchical_CBS-patches

jlelli commented 6 years ago

Skimming through the rebased patches I see the following problems/have the following considerations:

DEADLINE servers have stricter affinity requirements (w.r.t. current RT_GROUP_SCHED), how to deal with current users expectations?
RT_RUNTIME_SHARE goes away, so again this might be a problem with today's users
root level scheduling is different (EDF vs. Fixed Prio), existing users might see changes of behaviors
RT_THROTTLING works at root level as well (even when groups are not used), what about DEADLINE?

lucabe72 commented 6 years ago

Hi, just trying to reply to check if github issues are usable to keep track of the discussion:

_DEADLINE servers have stricter affinity requirements (w.r.t. current RT_GROUPSCHED), how to deal with current users expectations?

You mean, that the SCHED_DEADLINE tasks affinity should be set to the whole root domain, right? The issue here is that we create a dl server per CPU/core, so if the server FIFO or RR tasks have stricter affinity we risk to waste some CPU bandwidth. In theory, we could try to create only a dl server per CPU/core in the cgroup/taskset, but I am not sure about how to handle the admission control...

_RT_RUNTIMESHARE goes away, so again this might be a problem with today's users

Is RT_RUNTIME_SHARE really used in practice? In any case, we have a different mechanism to get a similar behaviour: when the runtime on a CPU/core is exhausted, instead of "migrating runtime" from other cores we migrate the served tasks to other cores with current runtime > 0 (if their affinities allows the migration). I am not sure about the current users expectations about this, but I believe the "migrate when runtime=0" behaviour can satisfy them

root level scheduling is different (EDF vs. Fixed Prio), existing users might see changes of behaviors

Not sure about what we can do here... Yes, we change from FP to EDF, but this is the whole point of the patch :)

_RTTHROTTLING works at root level as well (even when groups are not used), what about DEADLINE?

Yes, this is an issue... We should implement throttling for the root group too (not sure about how much difficult this can be, though). I'll try to have a look in the next months (first, I want to cleanup the patchset)

lucabe72 commented 6 years ago

As an additional point, the obvious TODO item I want to address before the others is a patchset cleanup:

I am not sure if all the hierarchical FP code has been really removed in patch 1... I need to check
Some patches (4, 5, 6, and 7) have to be split and merged in previous commits
The patch changing the extraversion probably needs to be the last one
I think I have an uncommitted fix for compilation with RT_GROUP_SCHED disabled

jlelli commented 6 years ago

DEADLINE servers have stricter affinity requirements (w.r.t. current RT_GROUP_SCHED), how to deal with current users expectations?

You mean, that the SCHED_DEADLINE tasks affinity should be set to the whole root domain, right? The issue here is that we create a dl server per CPU/core, so if the server FIFO or RR tasks have stricter affinity we risk to waste some CPU bandwidth. In theory, we could try to create only a dl server per CPU/core in the cgroup/taskset, but I am not sure about how to handle the admission control...

Right. I fear that current RT_GROUP_SCHED users are used to freely manage their tasks affinities, while we will be forcing them to adhere to more strict rules (even if we find a way to relax the "whole root domain" requirement). I'm not sure this is feasible at all. :-/

jlelli commented 6 years ago

Is RT_RUNTIME_SHARE really used in practice?

Not sure. We should ask users.. or remove it and see who complains. :-)

In any case, we have a different mechanism to get a similar behaviour: ...

Mmm, right. This might actually work, even though we will be increasing migrations and maybe introducing latencies by doing so?

lucabe72 commented 6 years ago

Since we are talking about issues... I just got this:

[37664.249222] WARNING: CPU: 2 PID: 3289 at /home/luca/Src/Kernel/tip/source/kernel/sched/deadline.c:326 task_non_contending+0x297/0x3e0
[37664.249224] Modules linked in: veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp iptable_filter bridge ipmi_ssif stp llc intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp shpchp ipmi_si coretemp ipmi_devintf ipmi_msghandler mei_me intel_cstate lpc_ich dcdbas acpi_power_meter intel_rapl_perf mei mac_hid kvm_intel kvm irqbypass ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 i2c_algo_bit ttm drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
[37664.249285]  syscopyarea pcbc sysfillrect aesni_intel sysimgblt aes_x86_64 fb_sys_fops crypto_simd glue_helper drm mxm_wmi megaraid_sas cryptd tg3 ahci libahci wmi
[37664.249301] CPU: 2 PID: 3289 Comm: node Not tainted 4.16.0-rc1-HCBS+ #3
[37664.249303] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.5.5 08/16/2017
[37664.249306] RIP: 0010:task_non_contending+0x297/0x3e0
[37664.249307] RSP: 0018:ffffbcdac8513cd8 EFLAGS: 00010002
[37664.249310] RAX: 0000000000000001 RBX: ffff9726dfdd5a00 RCX: 0000000000000000
[37664.249311] RDX: 0000000000527dcc RSI: 0000000000000047 RDI: ffff9726dfdd5a98
[37664.249312] RBP: ffff9726e2901800 R08: 0000000000000000 R09: ffff97271f417800
[37664.249314] R10: 000022416562e81a R11: 0000000000000000 R12: ffff9726dfdd5a98
[37664.249315] R13: ffff97271f862a00 R14: ffff97271f8632d0 R15: ffff9726e2c9af00
[37664.249317] FS:  00007f5874629740(0000) GS:ffff97271f840000(0000) knlGS:0000000000000000
[37664.249319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37664.249320] CR2: 0000000000803118 CR3: 0000000856d9a002 CR4: 00000000003606e0
[37664.249322] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[37664.249323] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[37664.249324] Call Trace:
[37664.249334]  dequeue_task_rt+0x1f2/0x300
[37664.249341]  __schedule+0xf6/0x850
[37664.249348]  ? ep_item_poll.isra.10+0x34/0x90
[37664.249351]  schedule+0x28/0x80
[37664.249356]  schedule_hrtimeout_range_clock+0x177/0x190
[37664.249360]  ? ep_scan_ready_list.constprop.17+0x208/0x210
[37664.249363]  ep_poll+0x2a3/0x3b0
[37664.249369]  ? wake_up_q+0x70/0x70
[37664.249373]  SyS_epoll_pwait+0x193/0x210
[37664.249379]  do_syscall_64+0x68/0x120
[37664.249383]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[37664.249386] RIP: 0033:0x7f5873bdf080
[37664.249387] RSP: 002b:00007ffc049788a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000119
[37664.249390] RAX: ffffffffffffffda RBX: 00007ffc04978900 RCX: 00007f5873bdf080
[37664.249391] RDX: 000000000000000a RSI: 00007ffc049788f0 RDI: 0000000000000004
[37664.249392] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000008
[37664.249393] R10: ffffffffffffffff R11: 0000000000000246 R12: 00007ffc04978900
[37664.249394] R13: 0000000000000000 R14: 0000000000000003 R15: 0000000000000000
[37664.249396] Code: 89 df 41 ff d4 48 85 ed 74 84 8b 85 8c 00 00 00 85 c0 0f 89 4f fe ff ff 48 8b 45 10 48 83 c0 80 0f 85 68 ff ff ff e9 3c fe ff ff <0f> 0b e9 ab fd ff ff 0f 0b e9 ae fd ff ff 80 3d d8 90 54 01 00
[37664.249442] ---[ end trace 50bd31591d19efc4 ]---

I'll look at it in the next days

jlelli commented 6 years ago

Is this (https://github.com/lucabe72/LinuxPatches/blob/89c4c6e25eee0a0c37dba8f1bf6acf50d0e9aa67/0009-Allow-deeper-hierarchies-of-RT-cgroups.patch#L6) the same as saying "only leaf groups can contain RT tasks"? How does it work today for RT groups?

jlelli commented 6 years ago

How does it work today for RT groups?

Huh.. user is free to do at please, so tasks can just get starved.. nice! :/

jlelli commented 6 years ago

Started a Wiki page to keep track of design choices: https://github.com/jlelli/linux/wiki/Hierarchical-CBS-design

lucabe72 commented 6 years ago

On 15 February 2018 at 15:16, Juri Lelli notifications@github.com wrote:

DEADLINE servers have stricter affinity requirements (w.r.t. current RT_GROUP_SCHED), how to deal with current users expectations?

You mean, that the SCHED_DEADLINE tasks affinity should be set to the whole root domain, right? The issue here is that we create a dl server per CPU/core, so if the server FIFO or RR tasks have stricter affinity we risk to waste some CPU bandwidth. In theory, we could try to create only a dl server per CPU/core in the cgroup/taskset, but I am not sure about how to handle the admission control...

Right. I fear that current RT_GROUP_SCHED users are used to freely manage their tasks affinities, while we will be forcing them to adhere to more strict rules (even if we find a way to relax the "whole root domain" requirement). I'm not sure this is feasible at all. :-/

Notice that the dl entities will be created on every CPU/core, but the FIFO and RR tasks in the group can have generic affinities without issues. The RT tasks affinities will be correctly respected (I think :), so the only issue is a bandwidth waste... But we will not impose restrictions on the RT_GROUP_SCHED users... Or am I missing something?

jlelli commented 6 years ago

On 15/02/18 14:11, Luca Abeni wrote:

On 15 February 2018 at 15:16, Juri Lelli notifications@github.com wrote:

DEADLINE servers have stricter affinity requirements (w.r.t. current RT_GROUP_SCHED), how to deal with current users expectations?

You mean, that the SCHED_DEADLINE tasks affinity should be set to the whole root domain, right? The issue here is that we create a dl server per CPU/core, so if the server FIFO or RR tasks have stricter affinity we risk to waste some CPU bandwidth. In theory, we could try to create only a dl server per CPU/core in the cgroup/taskset, but I am not sure about how to handle the admission control...

Right. I fear that current RT_GROUP_SCHED users are used to freely manage their tasks affinities, while we will be forcing them to adhere to more strict rules (even if we find a way to relax the "whole root domain" requirement). I'm not sure this is feasible at all. :-/

Notice that the dl entities will be created on every CPU/core, but the FIFO and RR tasks in the group can have generic affinities without issues. The RT tasks affinities will be correctly respected (I think :), so the

Ah, right. I guess the only remaining problem (to see if it's really such a big problem) is that pinned tasks will see differences w.r.t. today's RT_RUNTIME_SHARE, as they won't able to migrate and consume all bandwidth available to their group. As said, maybe not a big deal.

only issue is a bandwidth waste... But we will not impose restrictions on

Bandwidth can now be reclaimed with GRUB. So, not much waste in the busy case (with other DEADLINE servers active).

lucabe72 commented 6 years ago

Right. I fear that current RT_GROUP_SCHED users are used to freely manage their tasks affinities, while we will be forcing them to adhere to more strict rules (even if we find a way to relax the "whole root domain" requirement). I'm not sure this is feasible at all. :-/

Notice that the dl entities will be created on every CPU/core, but the FIFO and RR tasks in the group can have generic affinities without issues. The RT tasks affinities will be correctly respected (I think :), so the

Ah, right. I guess the only remaining problem (to see if it's really such a big problem) is that pinned tasks will see differences w.r.t. today's RT_RUNTIME_SHARE, as they won't able to migrate and consume all bandwidth available to their group. As said, maybe not a big deal.

Uhm... At this point, I need to understand what's the current expected behaviour for RT_GROUP_SCHED (and RT_RUNTIME_SHARE)... If a task is pinned to a CPU, why inserting it in a group that has runtime also on different CPUs? And is it really expected to consume runtime from other CPUs? I suspect this would allow the task to starve its local CPU?

Anyway, maybe it is better to discuss this directly...

only issue is a bandwidth waste... But we will not impose restrictions on

Bandwidth can now be reclaimed with GRUB. So, not much waste in the busy case (with other DEADLINE servers active).

Ah, good! I was forgetting about it :) Now, the question is: should we enable GRUB/RECLAIMING for cgroups? (I think the answer is "yes") Should we enable it unconditionally? (I do not know) If not, how can a user ask to enable reclaiming?

jlelli commented 6 years ago

On 16/02/18 00:18, Luca Abeni wrote:

Bandwidth can now be reclaimed with GRUB. So, not much waste in the busy case (with other DEADLINE servers active).

Ah, good! I was forgetting about it :) Now, the question is: should we enable GRUB/RECLAIMING for cgroups? (I think the answer is "yes") Should we enable it unconditionally? (I do not know) If not, how can a user ask to enable reclaiming?

Is it not the other way around? Entities that want to reclaim leftover bandwidth have to opt in by setting SCHED_FLAG_RECLAIM, so a group's leftover can be reclaimed already if a user wants to (nothing to change for the group). Maybe question is should we enable reclaiming from groups leftover by default for normal entities?

lucabe72 commented 6 years ago

[37664.249222] WARNING: CPU: 2 PID: 3289 at /home/luca/Src/Kernel/tip/source/kernel/sched/deadline.c:326 task_non_contending+0x297/0x3e0 [37664.249224] Modules linked in: veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp iptable_filter bridge ipmi_ssif stp llc intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp shpchp ipmi_si coretemp ipmi_devintf ipmi_msghandler mei_me intel_cstate lpc_ich dcdbas acpi_power_meter intel_rapl_perf mei mac_hid kvm_intel kvm irqbypass ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mgag200 i2c_algo_bit ttm drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel [37664.249285] syscopyarea pcbc sysfillrect aesni_intel sysimgblt aes_x86_64 fb_sys_fops crypto_simd glue_helper drm mxm_wmi megaraid_sas cryptd tg3 ahci libahci wmi [37664.249301] CPU: 2 PID: 3289 Comm: node Not tainted 4.16.0-rc1-HCBS+ #3 [37664.249303] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.5.5 08/16/2017 [37664.249306] RIP: 0010:task_non_contending+0x297/0x3e0 [37664.249307] RSP: 0018:ffffbcdac8513cd8 EFLAGS: 00010002 [37664.249310] RAX: 0000000000000001 RBX: ffff9726dfdd5a00 RCX: 0000000000000000 [37664.249311] RDX: 0000000000527dcc RSI: 0000000000000047 RDI: ffff9726dfdd5a98 [37664.249312] RBP: ffff9726e2901800 R08: 0000000000000000 R09: ffff97271f417800 [37664.249314] R10: 000022416562e81a R11: 0000000000000000 R12: ffff9726dfdd5a98 [37664.249315] R13: ffff97271f862a00 R14: ffff97271f8632d0 R15: ffff9726e2c9af00 [37664.249317] FS: 00007f5874629740(0000) GS:ffff97271f840000(0000) knlGS:0000000000000000 [37664.249319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [37664.249320] CR2: 0000000000803118 CR3: 0000000856d9a002 CR4: 00000000003606e0 [37664.249322] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [37664.249323] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [37664.249324] Call Trace: [37664.249334] dequeue_task_rt+0x1f2/0x300 [...]

I looked at this, and it is caused by patch 0006: https://github.com/lucabe72/LinuxPatches/blob/Hierarchical_CBS-patches/0006-Some-additional-fixed-to-be-squashed.patch

Looks like the following fixes the bug:

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5443d84706a8..e70009134295 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1109,7 +1109,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
        dl_se->dl_throttled = 0;
        if (rt_rq->rt_nr_running) {
            enqueue_dl_entity(dl_se, dl_se, ENQUEUE_REPLENISH);
-           task_contending(dl_se, 0);
+           //task_contending(dl_se, 0);

            resched_curr(rq);
 #ifdef CONFIG_SMP
@@ -1118,6 +1118,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 #endif
        } else {
            replenish_dl_entity(dl_se, dl_se);
+           task_non_contending(dl_se);
        }

        raw_spin_unlock(&rq->lock);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 473d9659efaa..8970a23eda1b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -374,7 +374,7 @@ static void update_curr_rt(struct rq *rq)
        /* A group exhausts the budget. */
        if (dl_runtime_exceeded(dl_se)) {
            dequeue_dl_entity(dl_se);
-           task_non_contending(dl_se);
+           //task_non_contending(dl_se);

            if (likely(start_dl_timer(dl_se)))
                dl_se->dl_throttled = 1;

I'll test this patch a little bit more, to check if it breaks anything else... And then I'll integrate it in patch 0006