nginx / unit

NGINX Unit - universal web app server - a lightweight and versatile open source server that simplifies the application stack by natively executing application code across eight different programming language runtimes.
https://unit.nginx.org
Apache License 2.0
5.37k stars 323 forks source link

How to manually set the number of router threads #1042

Closed pbromb closed 1 month ago

pbromb commented 9 months ago

Hello, I am running an application using Nginx Unit in Kubernetes environment. The unit: router process automatically runs in the number of threads, which equals the number of CPU cores on the physical server.

src/nxt_router.c

if (rtcf->threads == 0) {
    rtcf->threads = nxt_ncpu;
}

in the code I also found

src/nxt_router.c

static nxt_conf_map_t  nxt_router_conf[] = {
    {
        nxt_string("listeners_threads"),
        NXT_CONF_MAP_INT32,
        offsetof(nxt_router_conf_t, threads),
    },
};

but I was not able to set this from the configuration file

Is it possible to set this value ?

ac000 commented 9 months ago

On Sat, 30 Dec 2023 06:46:06 -0800 Paweł B @.***> wrote:

Hello, I am running an application using Nginx Unit in Kubernetes environment. The unit: router process automatically runs in the number of threads, which equals the number of CPU cores on the physical server.

src/nxt_router.c

if (rtcf->threads == 0) {
    rtcf->threads = nxt_ncpu;
}

in the code I also found

src/nxt_router.c

static nxt_conf_map_t  nxt_router_conf[] = {
    {
        nxt_string("listeners_threads"),
        NXT_CONF_MAP_INT32,
        offsetof(nxt_router_conf_t, threads),
    },
};

but I was not able to set this from the configuration file

Is it possible to set this value ?

Looks like it isn't currently hooked up to the configuration system...

juliantaylor commented 1 month ago

There needs to be a way to configure the number of threads to use in particular as unit is using sched_yield using spinlocks with no sleep which have extremely bad performance when overcommitted.

Running unit on a host with 64 cores (including hyperthreads) with a cpu limit of 2 under load will cause the system to spend 90% or more of its time rescheduling its threads waiting for the lock while the process which is actually holding the lock only has a very small chance of running and releasing it.

A sample profile of an application under this load looks like this:

-   92.81%     0.00%  unitd    [unknown]                                         [k] 0x0000000000000004                                                                                                           ◆
   - 0x4                                                                                                                                                                                                          ▒
      - 91.42% 0x7f844e03f527                                                                                                                                                                                     ▒
         - 55.42% entry_SYSCALL_64_after_hwframe                                                                                                                                                                  ▒
            - do_syscall_64                                                                                                                                                                                       ▒
               - 39.63% __x64_sys_sched_yield                                                                                                                                                                     ▒
                  - 34.50% schedule                                                                                                                                                                               ▒
                     - 34.11% __schedule                                                                                                                                                                          ▒
                        - 23.19% pick_next_task_fair   

pretty much all of the cpu usage ends up in the userspace part of pthread_yield and the kernel scheduler.

In general using spinlocks from userspace is very dangerous as userspace cannot portably determine whether the lock holder is running and go to sleep if not and thus a preempted lock holder can be blocked by the process spinning on the spinlock from releasing the lock as you also cannot give the lock holder priority for the rescheduling triggered by sched_yield.

The discrepancy between available cpu for the process group and the number of threads that can contend for the locks which can exist in kubernetes/cfs_quota-using systems will amplify these problems massively and having no option to reduce the thread count is a large problem.

An environment variable that allows configuring the thread count would be a good solution.

ac000 commented 1 month ago

I hear you regarding the use of spinlocks.

We had an internal discussion a while back about the use of sched_yield(2) which at the very least it seems should only be used in conjunction with real-time scheduling.

It would be interesting to see what effect this patch has, whether it's good, bad or indifferent (yes, it may likely fubar things)...

diff --git ./src/nxt_spinlock.c ./src/nxt_spinlock.c
index 940be724..78636fc2 100644
--- ./src/nxt_spinlock.c
+++ ./src/nxt_spinlock.c
@@ -82,8 +82,6 @@ nxt_thread_spin_lock(nxt_thread_spinlock_t *lock)
                 goto again;
             }
         }
-
-        nxt_thread_yield();
     }
 }

Then there is this code

/* It should be adjusted with the "spinlock_count" directive. */                
static nxt_uint_t  nxt_spinlock_count = 1000;                                   

void                                                                            
nxt_thread_spin_init(nxt_uint_t ncpu, nxt_uint_t count)                         
{                                                                               
    switch (ncpu) {                                                             

    case 0:                                                                     
        /* Explicit spinlock count. */                                          
        nxt_spinlock_count = count;                                             
        break;                                                                  

    case 1:                                                                     
        /* Spinning is useless on UP. */                                        
        nxt_spinlock_count = 0;                                                 
        break;                                                                  

    default:                                                                    
        /*                                                                      
         * SMP.                                                                 
         *                                                                      
         * TODO: The count should be 10 on a virtualized system                 
         * since virtualized CPUs may share the same physical CPU.              
         */                                                                     
        nxt_spinlock_count = 1000;                                              
        break;                                                                  
    }                                                                           
}

in src/nxt_spinlock.c

(I'm not sure in what cases ncpu would be 0)

This function is called once from Unit with the arguments of (nr_cpus, 0)

It may be worth trying to lower (I guess) nxt_spinlock_count (in the default case of the switch statement)...

juliantaylor commented 1 month ago

decreasing nxt_spinlock_count would likely have a detrimental effect as it would cause the sched_yield to be called more often which triggers more kernel rescheduling which is very costly especially on larger systems.

The yielding approach is somewhat outdated, its idea is that the spinning thread defers its timeslice and then hopefully the process holding the actual lock runs and releases the lock. This works out when the machines are not very loaded and releasing this one waiter has a high chance of scheduling the holder. But it terrible if there are multiple spinning waiters and one holder as is the case here where there can be dozes or hundreds of waiters and one holder, unless there are also dozens of free cpu cores (or allowed quota as in the case of e.g. kubernetes pods) the chance that the holder actually gets scheduled is very low (1/spinning waiters).

What spinlocks need to do is go to sleep when the holder is not currently running, so that the spinning waiters don't prevent the locker from running and freeing the lock but that is not actually trivial to do from userspace.

There was a solution to this proposed for linux using restart-able sequences not very long ago and I think it was even merged, see this https://lwn.net/Articles/944895/ but it would require a very recent kernel to do and is of course not portable.

ac000 commented 1 month ago

Testing our spinlock implementation in an albeit simple test program where a number of threads are trying to increment a counter.

By decreasing nxt_spinlock_count to 10 I see a reduction in runtime of ~50-75%, so it may be worth a shot...

ac000 commented 1 month ago

Just to clarify

Running unit on a host with 64 cores (including hyperthreads) with a cpu limit of 2 under load will cause the system to

What is meant by 'cpu limit of 2'?

How many router threads were there?

Is this bare metal or a VM?

Was this with an application or just static file serving?

Is the above profile for a particular lock or an amalgamation of all the spin-locks?

juliantaylor commented 1 month ago

A containerized workload, the host has 64-256 cores (including hyperthreads), the application has access to two of them (using cgroups cfs quotas) there are as many router threads as the host has cores as it is not configurable.

the profile is just a snapshot of what was currently running, could be any of the locks.

ac000 commented 1 month ago

Ah, OK, so the system may have 64 'cpus', but you're limiting unit to only use two of those and _SC_NPROCESSORS_ONLN is not affected by things like cputsets and sched_setaffinity(2).

On Linux at least (need to check other systems) we could use sched_getaffinity(2) instead to get the number of 'cpus' that are allowed to run on.

juliantaylor commented 1 month ago

affinity is only one of the options, more commonly cpu bandwidth is limited using cgroups instead of pinning to cores

there are two versions v2 uses /sys/fs/cgroup/cpu.max which contains timeslice the process can run and period the timeslice applies to, the available cpu is timeslice divided by the period (the result can be fractional). see cpu.max in https://docs.kernel.org/admin-guide/cgroup-v2.html

the older variant cgroup v1 uses two pseudofiles cfs_quota_us and cfs_period_us in the cgroup filesystem

ac000 commented 1 month ago

Yes, but trying to determine the number of threads to run based on the things like cpu quota is a little different problem than just getting the number of threads to match the available number of cpus.

Maybe at some point it makes sense to make it configurable but at the very least we should try to get it basically right by default...

juliantaylor commented 1 month ago

its the same problem, in particular with your choice of lock and implementation you must never start more threads than can actually run, the method on how your application is limited does not matter. bandwidth limiting via cgroups is extremely common, supporting this will give you the most coverage on linux systems

making it configurable is preferable though as it allows users to override the bad defaults when you don't properly detect the environment, its also like a 3 line patch if you use an environment variable.

ac000 commented 1 month ago

its the same problem, in particular with your choice of lock and implementation you must never start more threads than

Yes, this is a whole bigger discussion.

can actually run, the method on how your application is limited does not matter. bandwidth limiting via cgroups is extremely common, supporting this will give you the most coverage on linux systems

I have a patch to make use of sched_getaffinity(2) on Linux. This will help if Unit is being run under cpuset(7)'s or started via something like taskset(1).

It will also help with things like docker where the --cpuset-cpus= option has been used.

That's a simple easy win.

Determining the number of threads to create based on cpu quotas is a slightly different problem that will take longer to develop and test etc (if it's something we even wish to do...).

making it configurable is preferable though as it allows users to override the bad defaults when you don't properly detect the environment, its also like a 3 line patch if you use an environment variable.

Hooking it up to the config system is relatively simple and we may do that, but we should first try to get things right by default.

I'll bring the question of making it configurable up with the wider team...

ac000 commented 1 month ago

Here's a patch to enable configuring the number of router threads...

diff --git ./src/nxt_conf_validation.c ./src/nxt_conf_validation.c
index 04091745..86199f84 100644
--- ./src/nxt_conf_validation.c
+++ ./src/nxt_conf_validation.c
@@ -240,6 +240,7 @@ static nxt_int_t nxt_conf_vldt_js_module_element(nxt_conf_validation_t *vldt,

 static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[];
+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_websocket_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_static_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_forwarded_members[];
@@ -309,6 +310,11 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_members[] = {
         .type       = NXT_CONF_VLDT_OBJECT,
         .validator  = nxt_conf_vldt_object,
         .u.members  = nxt_conf_vldt_http_members,
+    }, {
+        .name       = nxt_string("router"),
+        .type       = NXT_CONF_VLDT_OBJECT,
+        .validator  = nxt_conf_vldt_object,
+        .u.members  = nxt_conf_vldt_router_members,
 #if (NXT_HAVE_NJS)
     }, {
         .name       = nxt_string("js_module"),
@@ -377,6 +383,16 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[] = {
 };

+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[] = {
+    {
+        .name       = nxt_string("threads"),
+        .type       = NXT_CONF_VLDT_INTEGER,
+    },
+
+    NXT_CONF_VLDT_END
+};
diff --git ./src/nxt_conf_validation.c ./src/nxt_conf_validation.c
index 04091745..86199f84 100644
--- ./src/nxt_conf_validation.c
+++ ./src/nxt_conf_validation.c
@@ -240,6 +240,7 @@ static nxt_int_t nxt_conf_vldt_js_module_element(nxt_conf_va
lidation_t *vldt,

 static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[];
+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_websocket_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_static_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_forwarded_members[];
@@ -309,6 +310,11 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_member
s[] = {
         .type       = NXT_CONF_VLDT_OBJECT,
         .validator  = nxt_conf_vldt_object,
         .u.members  = nxt_conf_vldt_http_members,
+    }, {
+        .name       = nxt_string("router"),
+        .type       = NXT_CONF_VLDT_OBJECT,
+        .validator  = nxt_conf_vldt_object,
+        .u.members  = nxt_conf_vldt_router_members,
 #if (NXT_HAVE_NJS)
     }, {
         .name       = nxt_string("js_module"),
@@ -377,6 +383,16 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[]
 = {
 };

+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[] = {
+    {
+        .name       = nxt_string("threads"),
+        .type       = NXT_CONF_VLDT_INTEGER,
+    },
+
+    NXT_CONF_VLDT_END
+};
+
+
 static nxt_conf_vldt_object_t  nxt_conf_vldt_websocket_members[] = {
     {
         .name       = nxt_string("read_timeout"),
diff --git ./src/nxt_router.c ./src/nxt_router.c
index 43209451..2676402b 100644
--- ./src/nxt_router.c
+++ ./src/nxt_router.c
@@ -1412,7 +1412,7 @@ nxt_router_conf_send(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,

 static nxt_conf_map_t  nxt_router_conf[] = {
     {
-        nxt_string("listeners_threads"),
+        nxt_string("threads"),
         NXT_CONF_MAP_INT32,
         offsetof(nxt_router_conf_t, threads),
     },
@@ -1630,7 +1630,7 @@ nxt_router_conf_create(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,
     nxt_conf_value_t            *js_module;
 #endif
     nxt_conf_value_t            *root, *conf, *http, *value, *websocket;
-    nxt_conf_value_t            *applications, *application;
+    nxt_conf_value_t            *applications, *application, *router_conf;
     nxt_conf_value_t            *listeners, *listener;
     nxt_socket_conf_t           *skcf;
     nxt_router_conf_t           *rtcf;
@@ -1640,6 +1640,7 @@ nxt_router_conf_create(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,
     nxt_router_app_conf_t       apcf;
     nxt_router_listener_conf_t  lscf;

+    static const nxt_str_t  router_path = nxt_string("/settings/router");
     static const nxt_str_t  http_path = nxt_string("/settings/http");
     static const nxt_str_t  applications_path = nxt_string("/applications");
     static const nxt_str_t  listeners_path = nxt_string("/listeners");
@@ -1673,11 +1674,14 @@ nxt_router_conf_create(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,
     rtcf = tmcf->router_conf;
     mp = rtcf->mem_pool;

-    ret = nxt_conf_map_object(mp, root, nxt_router_conf,
-                              nxt_nitems(nxt_router_conf), rtcf);
-    if (ret != NXT_OK) {
-        nxt_alert(task, "root map error");
-        return NXT_ERROR;
+    router_conf = nxt_conf_get_path(root, &router_path);
+    if (router_conf != NULL) {
+        ret = nxt_conf_map_object(mp, router_conf, nxt_router_conf,
+                                  nxt_nitems(nxt_router_conf), rtcf);
+        if (ret != NXT_OK) {
+            nxt_alert(task, "rooter_conf map error");
+            return NXT_ERROR;
+        }
     }

     if (rtcf->threads == 0) {

Configure like

{
    "listeners": {

    },                                                                          

    "settings": {                                                               
        "router": {                                                             
            "threads": 2                                                                                        
        }                                                                       
    },

}

Hopefully something like this can get in the next release of Unit...