Latency_nice proposal for use in TT

Good morning First of all, I want to apologize for my English

Additionally, I want to thank you for your hard work on Linux Scheduler and more especially for BABY-CPU-SCHEDULE, thanks to which I am constantly learning various algorithms. I was forced to do so by the COVID-19 pandemic to bring in my 10-year-old son's laptop for remote work. I am not a programmer and I also want to mention it at the outset.

Back to the point I am currently checking my stock kernel 5.13 with your Cachy Scheduler v5.9-Idle with additional patches some solution.Parth Shah latency_nice series of patches

https://lkml.org/lkml/2020/5/7/575

Unfortunately I don't have benchmarks but implemented it alongside MLFQ for the classification of latency_nice tasks. After a few modifications, the kernel classifies only user processes with no children, leaving the system alone. As for me the usability experience for normal system operation is very much promising. If you are able and have time to experiment, you can take a look at series of patches Patha Shah. Maybe the idea will be useful for the development of TT?

Currently I'm not going to switch to kernel> 5.13, I don't I don't know why, but for me it works weird on the desktop (subjectively feeling).

I did a general test of my kernels with a backport to 5.13 of your solutions, but as you know, what's good for desktops doesn't always equal performance. Mine is cachyb2. I know compiling Clang 13 with LTO is always faster than GCC even with the GCC LTO, but the overall picture shows that 5.14 and 5.15 after the changes weirdly poorly with the SCHED_CORE changes.

https://openbenchmarking.org/result/2111136-IB-SCHEDULER42&export=pdf

Same kernel settings as for BABY-CPU: CONFIG_HZ_803 and https://github.com/hamadmarri/cacule-cpu-scheduler/blob/master/scripts/apply_suggested_configs.sh

I can reveal a binary version of my kernel. I'm ashamed of my code. I don't want to publish this to you at the moment.

Thank you for your interest

Hi @smarkusg @hamadmarri,

If my understanding about the patch is correct, what that patch seems to do is:

Introduce a latency_nice value for each task that can be set from userspace. This value will also be propagated to the children of that task when fork.
For each CPU, if there are tasks with latency_nice==-20, then it prevents that CPU from entering the IDLE state.

I remember those still open issues in CacULE with regards to IDLE/wakeup in NO_HZ settings. Since tt-scheduler can detect the type of tasks, we probably do not need the userspace to manually set the latency_nice value for realtime/interactive tasks. Couldn't we try something like:

For REALTIME or INTERACTIVE tasks, set latency_sensitive=1. This lentency_sensitive replaces the latency_nice value from that patch, since for the use case in that patch, the scaling in [-20, 19] never used as it only checked whether latency_nice is -20. This value will also be propagated to the children of that task when fork.
For each CPU, if there is a task with latency_sensitive=1, then prevent that CPU from entering the IDLE state.

@hamadmarri, What do you think?

Hi @smarkusg

Thank you so much for your experiments. I am reading Part Shah's patch. I will check if we can integrate the patch with TT as @raykzhao showed.

Regarding the benchmarks in (https://openbenchmarking.org/result/2111136-IB-SCHEDULER42&export=pdf), I am seeing TT performs poorly in those tests. Are there any differences in kernel configs or patches?

Thank you so much for the proposal.

@smarkusg @raykzhao

I am having a difficulty to find the full patch from https://lkml.org/lkml/2020/5/7/575 If anyone can help send me the link that contains the whole patch, or teach me how the lkml.org patch navigation works :/

Sorry about that

@raykzhao @smarkusg

Please check this patch latsens.patch.zip

I am running it right now. It is more likely similar to hz_periodic even though I have nohz_full set, I got very similar ticks numbers

cat /proc/interrupts | grep -i local
LOC:     761632     700412     701693     695537   Local timer interrupts

And the fan is crying with 1666Hz

Please let me know if any performance gain in your tests, I will run some test soon

Thank you

Notice that realtime tasks can be assigned to all cpus since the boot up. There must be a fine way to decay nr_lat_sensitive for old sleeping realtime tasks. For now it is just behaving like periodic ticks.

R2:

latsens-r2.patch.zip

Every 19ms, the nr_lat_sens gets decremented by 1. This at least relaxes the ticks for idle cpus:

cat /proc/interrupts | grep -i local
LOC:     656412     121611      99728      85649   Local timer interrupts

Here is the latency patch the original one

From d6fa9e1bc40d6c563ad56718ae1813fec361c943 Mon Sep 17 00:00:00 2001
From: "P. Jung" <ptr1337@cachyos.org>
Date: Sun, 21 Nov 2021 11:15:01 +0000
Subject: [PATCH] latency-test

Signed-off-by: P. Jung <ptr1337@cachyos.org>
---
 include/linux/sched.h            |  1 +
 include/uapi/linux/sched.h       |  4 +++-
 include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
 init/init_task.c                 |  1 +
 kernel/sched/core.c              | 26 ++++++++++++++++++++++++++
 kernel/sched/debug.c             |  1 +
 kernel/sched/sched.h             | 18 ++++++++++++++++++
 tools/include/uapi/linux/sched.h |  4 +++-
 8 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c1a927ddec64..2acfec4589e2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -774,6 +774,7 @@ struct task_struct {
    int             static_prio;
    int             normal_prio;
    unsigned int            rt_priority;
+   int             latency_nice;

    const struct sched_class    *sched_class;
    struct sched_entity     se;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..b2e932c25be6 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS     0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN  0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX  0x40
+#define SCHED_FLAG_LATENCY_NICE        0x80

 #define SCHED_FLAG_KEEP_ALL    (SCHED_FLAG_KEEP_POLICY | \
                 SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
             SCHED_FLAG_RECLAIM     | \
             SCHED_FLAG_DL_OVERRUN      | \
             SCHED_FLAG_KEEP_ALL        | \
-            SCHED_FLAG_UTIL_CLAMP)
+            SCHED_FLAG_UTIL_CLAMP      | \
+            SCHED_FLAG_LATENCY_NICE)

 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index f2c4589d4dbf..0aa4e3b6ed59 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -10,6 +10,7 @@ struct sched_param {

 #define SCHED_ATTR_SIZE_VER0   48  /* sizeof first published struct */
 #define SCHED_ATTR_SIZE_VER1   56  /* add: util_{min,max} */
+#define SCHED_ATTR_SIZE_VER2   60  /* add: latency_nice */

 /*
  * Extended scheduling parameters data structure.
@@ -98,6 +99,22 @@ struct sched_param {
  * scheduled on a CPU with no more capacity than the specified value.
  *
  * A task utilization boundary can be reset by setting the attribute to -1.
+ *
+ * Latency Tolerance Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the relative latency
+ * requirements of a task with respect to the other tasks running/queued in the
+ * system.
+ *
+ * @ sched_latency_nice    task's latency_nice value
+ *
+ * The latency_nice of a task can have any value in a range of
+ * [LATENCY_NICE_MIN..LATENCY_NICE_MAX].
+ *
+ * A task with latency_nice with the value of LATENCY_NICE_MIN can be
+ * taken for a task with lower latency requirements as opposed to the task with
+ * higher latency_nice.
  */
 struct sched_attr {
    __u32 size;
@@ -120,6 +137,8 @@ struct sched_attr {
    __u32 sched_util_min;
    __u32 sched_util_max;

+   /* latency requirement hints */
+   __s32 sched_latency_nice;
 };

 #endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/init/init_task.c b/init/init_task.c
index 2d024066e27b..048d3a932e81 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -78,6 +78,7 @@ struct task_struct init_task
    .prio       = MAX_PRIO - 20,
    .static_prio    = MAX_PRIO - 20,
    .normal_prio    = MAX_PRIO - 20,
+   .latency_nice   = 0,
    .policy     = SCHED_NORMAL,
    .cpus_ptr   = &init_task.cpus_mask,
    .user_cpus_ptr  = NULL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aea60eae21a7..fe7d49c12176 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4341,6 +4341,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
     */
    p->prio = current->normal_prio;

+   /* Propagate the parent's latency requirements to the child as well */
+   p->latency_nice = current->latency_nice;
+
    uclamp_fork(p);

    /*
@@ -4357,6 +4360,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
        p->prio = p->normal_prio = p->static_prio;
        set_load_weight(p, false);

+       p->latency_nice = DEFAULT_LATENCY_NICE;
        /*
         * We don't need the reset flag anymore after the fork. It has
         * fulfilled its duty:
@@ -7191,6 +7195,9 @@ static void __setscheduler_params(struct task_struct *p,
    p->rt_priority = attr->sched_priority;
    p->normal_prio = normal_prio(p);
    set_load_weight(p, true);
+
+   if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
+       p->latency_nice = attr->sched_latency_nice;
 }

 /*
@@ -7317,6 +7324,17 @@ static int __sched_setscheduler(struct task_struct *p,
            return retval;
    }

+   if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+       if (attr->sched_latency_nice > MAX_LATENCY_NICE)
+           return -EINVAL;
+       if (attr->sched_latency_nice < MIN_LATENCY_NICE)
+           return -EINVAL;
+       /* Use the same security checks as NICE */
+       if (attr->sched_latency_nice < p->latency_nice &&
+           !capable(CAP_SYS_NICE))
+           return -EPERM;
+   }
+
    if (pi)
        cpuset_read_lock();

@@ -7351,6 +7369,9 @@ static int __sched_setscheduler(struct task_struct *p,
            goto change;
        if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
            goto change;
+       if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE &&
+           attr->sched_latency_nice != p->latency_nice)
+           goto change;

        p->sched_reset_on_fork = reset_on_fork;
        retval = 0;
@@ -7649,6 +7670,9 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
        size < SCHED_ATTR_SIZE_VER1)
        return -EINVAL;

+   if ((attr->sched_flags & SCHED_FLAG_LATENCY_NICE) &&
+       size < SCHED_ATTR_SIZE_VER2)
+       return -EINVAL;
    /*
     * XXX: Do we want to be lenient like existing syscalls; or do we want
     * to be strict and return an error on out-of-bounds values?
@@ -7886,6 +7910,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
    get_params(p, &kattr);
    kattr.sched_flags &= SCHED_FLAG_ALL;

+   kattr.sched_latency_nice = p->latency_nice;
+
 #ifdef CONFIG_UCLAMP_TASK
    /*
     * This could race with another potential updater, but this is fine
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 17a653b67006..b11a32f21164 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1038,6 +1038,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 #endif
    P(policy);
    P(prio);
+   P(latency_nice);
    if (task_has_dl_policy(p)) {
        P(dl.runtime);
        P(dl.deadline);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3d3e5793e117..ea478879e67d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -106,6 +106,24 @@ extern void call_trace_sched_update_nr_running(struct rq *rq, int count);
  */
 #define NS_TO_JIFFIES(TIME)    ((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))

+/*
+ * Latency nice is meant to provide scheduler hints about the relative
+ * latency requirements of a task with respect to other tasks.
+ * Thus a task with latency_nice == 19 can be hinted as the task with no
+ * latency requirements, in contrast to the task with latency_nice == -20
+ * which should be given priority in terms of lower latency.
+ */
+#define MAX_LATENCY_NICE   19
+#define MIN_LATENCY_NICE   -20
+
+#define LATENCY_NICE_WIDTH \
+   (MAX_LATENCY_NICE - MIN_LATENCY_NICE + 1)
+
+/*
+ * Default tasks should be treated as a task with latency_nice = 0.
+ */
+#define DEFAULT_LATENCY_NICE   0
+
 /*
  * Increase resolution of nice-level calculations for 64-bit architectures.
  * The extra resolution improves shares distribution and load balancing of
diff --git a/tools/include/uapi/linux/sched.h b/tools/include/uapi/linux/sched.h
index 3bac0a8ceab2..ecc4884bfe4b 100644
--- a/tools/include/uapi/linux/sched.h
+++ b/tools/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS     0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN  0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX  0x40
+#define SCHED_FLAG_LATENCY_NICE        0X80

 #define SCHED_FLAG_KEEP_ALL    (SCHED_FLAG_KEEP_POLICY | \
                 SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
             SCHED_FLAG_RECLAIM     | \
             SCHED_FLAG_DL_OVERRUN      | \
             SCHED_FLAG_KEEP_ALL        | \
-            SCHED_FLAG_UTIL_CLAMP)
+            SCHED_FLAG_UTIL_CLAMP      | \
+            SCHED_FLAG_LATENCY_NICE)

 #endif /* _UAPI_LINUX_SCHED_H */
-- 
2.34.0

Hi @hamadmarri

Benchamark for TT marked as git20211 for kernel 5.15 came from Xanmod repository. The difference in configuration was as in the current Xanmod release:

edge -> tt

CONFIG_IRQ_WORK=y
 CONFIG_BUILDTIME_TABLE_SORT=y
 CONFIG_THREAD_INFO_IN_TASK=y
+CONFIG_TT_SCHED=y
+CONFIG_TT_ACCOUNTING_STATS=y

 #
 # General setup
@@ -122,7 +124,6 @@ CONFIG_USERMODE_DRIVER=y
 # CONFIG_PREEMPT_NONE is not set
 CONFIG_PREEMPT_VOLUNTARY=y
 # CONFIG_PREEMPT is not set
-CONFIG_SCHED_CORE=y

 #
 # CPU/Task time and stats accounting
@@ -167,8 +168,7 @@ CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
 #
 # Scheduler features
 #
-CONFIG_UCLAMP_TASK=y
-CONFIG_UCLAMP_BUCKETS_COUNT=5
+# CONFIG_UCLAMP_TASK is not set
 # end of Scheduler features

 CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
@@ -184,11 +184,6 @@ CONFIG_MEMCG_SWAP=y
 CONFIG_MEMCG_KMEM=y
 CONFIG_BLK_CGROUP=y
 CONFIG_CGROUP_WRITEBACK=y
-CONFIG_CGROUP_SCHED=y
-CONFIG_FAIR_GROUP_SCHED=y
-CONFIG_CFS_BANDWIDTH=y
-# CONFIG_RT_GROUP_SCHED is not set
-CONFIG_UCLAMP_TASK_GROUP=y
 CONFIG_CGROUP_PIDS=y
 CONFIG_CGROUP_RDMA=y
 CONFIG_CGROUP_FREEZER=y
@@ -210,8 +205,6 @@ CONFIG_USER_NS=y
 CONFIG_PID_NS=y
 CONFIG_NET_NS=y
 CONFIG_CHECKPOINT_RESTORE=y
-CONFIG_SCHED_AUTOGROUP=y
-CONFIG_SCHED_AUTOGROUP_DEFAULT_ENABLED=y
 # CONFIG_SYSFS_DEPRECATED is not set
 CONFIG_RELAY=y
 CONFIG_BLK_DEV_INITRD=y
@@ -513,9 +506,9 @@ CONFIG_EFI_MIXED=y
 # CONFIG_HZ_100 is not set
 # CONFIG_HZ_250 is not set
 # CONFIG_HZ_300 is not set
-CONFIG_HZ_500=y
-# CONFIG_HZ_1000 is not set
-CONFIG_HZ=500
+# CONFIG_HZ_500 is not set
+CONFIG_HZ_1000=y
+CONFIG_HZ=1000
 CONFIG_SCHED_HRTICK=y
 CONFIG_KEXEC=y
 CONFIG_KEXEC_FILE=y

Kernel marked as 5.13.19*tt was compiled like rest of my 5.13 kernels - same LTO=full clang "CONFIG_HZ_803=y" configuration and simple settings like for Baby-CPU-Scheduler.

As I find a moment I will reinstall the current tt 5.15 release for xanmod from the site and compare in a simple benchmark.

I will also check the patch for nr_lat_sensitive and let you know too.

Thanks again for your work.

R2:

latsens-r2.patch.zip

Every 19ms, the nr_lat_sens gets decremented by 1. This at least relaxes the ticks for idle cpus:
cat /proc/interrupts | grep -i local
LOC:     656412     121611      99728      85649   Local timer interrupts

Hi @everyone,

Any testing updates/findings about latsens-r2.patch?

Thank you

hamadmarri / TT-CPU-Scheduler

Latency_nice proposal for use in TT #8