libuv / libuv

Cross-platform asynchronous I/O
https://libuv.org/
MIT License
23.95k stars 3.58k forks source link

Bind named using 100% cpu with libuv 1.46.0 #4089

Closed KIC-8462852 closed 1 year ago

KIC-8462852 commented 1 year ago

Version: 1.46.0 Platform: Linux 5.10.186 #1 SMP Sat Jul 8 18:25:56 WEST 2023 i686 GNU/Linux

Hi, I had bind 9.18.16 compiled with libuv 1.44.2 working without any issues. After compiling and installing libuv 1.46.0, the named process was continuously using 100% CPU. I reverted to version 1.44.2 resuming normality. I reproduced it twice with the same results. So it doesn't seem like the problem is with bind itself. Thx.

bnoordhuis commented 1 year ago

Can you check why/where it's using so much CPU time? perf top should give a clue.

FWIW, the 5.x kernels have had several io_uring resource consumption bugs so it's quite possible a kernel upgrade makes the problem go away.

edit: you can test by setting UV_USE_IO_URING=0 in the environment

KIC-8462852 commented 1 year ago

I think you are right about the kernel. With kernel 5.10.170 I can reproduce the problem, but not with 5.10.186... there are several io_uring patches in between... By the way, the perf top when CPU is at 100%:

    72.76%  [kernel]       [k] io_sq_thread
    11.61%  [kernel]       [k] _cond_resched
     8.76%  [kernel]       [k] rcu_all_qs
     5.82%  [kernel]       [k] __x86_return_thunk

Thx.

bnoordhuis commented 1 year ago

Thanks for testing. I've opened #4093 to blacklist pre-5.10.186 kernels.

bienzaaron commented 8 months ago

Seeing what I believe is the same issue during load testing in a node.js-based service with v20.11.0 (so libuv 1.47.0 I believe), but on a slightly newer kernel version (5.10.201).

Running the service with UV_USE_IO_URING=0 drops CPU in our load tests back to where we expect it (see before and after): image

Wondering if the version here needs tweaked https://github.com/libuv/libuv/blob/3b6a1a14caeeeaf5510f2939a8e28ed9ba0ad968/src/unix/linux.c#L477-L495

This service is running on AWS Fargate, so I can't easily upgrade kernel versions or anything like that - planning on reporting the same issue to AWS.

uname -r:

5.10.201-191.748.amzn2.x86_64

I can try grabbing a CPU profile if desired.

bienzaaron commented 8 months ago

changelogs mentioning io_uring between 186 and 201: https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.188 https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.190 https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.195

after 201 mentioning io_uring: https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.202 https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.203 https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.204

I can try testing these to try to narrow down the version which may have re-introduced the issue.

bnoordhuis commented 8 months ago

At a quick glance none of those 5.10.2xx releases contain relevant bug fixes. I guess we'd have to blacklist all 5.10.x kernels if we can't distinguish between good and bad kernels.

@santigimeno WDYT?

santigimeno commented 7 months ago

@bienzaaron how easy is it to reproduce the issue? Do you have some code you can share? If possible I'd love to take a look before blacklisting more kernels.

bienzaaron commented 7 months ago

the following reproduces it for me on the latest available Amazon Linux 2 kernel (5.10.205-195.804.amzn2.x86_64):

const fs = require('node:fs');

function append() {
  fs.appendFile('log.txt', 'hello'.repeat(5000), () => {});
  setTimeout(append, 10);
}
append();

running with UV_USE_IO_URING=0 gives minimal cpu usage as reported by top:

top - 16:42:57 up 44 min,  1 user,  load average: 1.66, 0.98, 0.67
Tasks: 184 total,   1 running, 130 sleeping,   0 stopped,   1 zombie
%Cpu(s):  2.7 us,  3.8 sy,  0.2 ni, 93.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.2 st
KiB Mem :  3787712 total,   265912 free,   746704 used,  2775096 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  2754936 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1252 root      20   0  133668  74004  73428 S   3.3  2.0   0:52.11 systemd-journal
22614 ec2-user  20   0 1034648  48448  38832 S   2.3  1.3   0:00.26 node
 3863 root      20   0  810308  51436  49104 S   1.7  1.4   0:20.58 rsyslogd
 3893 root      20   0  830692  16624   6228 S   1.0  0.4   0:18.41 nxlog

and perf top:

   6.75%  perf                      [.] __symbols__insert
   3.37%  perf                      [.] rb_next
   2.75%  libc-2.26.so              [.] __close
   2.15%  [kernel]                  [k] syscall_enter_from_user_mode
   2.00%  [kernel]                  [k] finish_task_switch
   1.74%  [kernel]                  [k] audit_filter_syscall.constprop.0.isra.0
   1.60%  [kernel]                  [k] cshook_systemcalltable_pre_compat_sys_ioctl
   1.48%  [kernel]                  [k] do_user_addr_fault
   1.48%  libc-2.26.so              [.] _int_malloc

without the env var set (I am running this containerized with CPU throttled to half a core, which is why it's only 50), see CPU usage jump significantly in both top and perf top:

top - 16:43:30 up 44 min,  1 user,  load average: 1.42, 0.98, 0.68
Tasks: 182 total,   1 running, 133 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.3 us, 25.0 sy,  0.0 ni, 73.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3787712 total,   260188 free,   751752 used,  2775772 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  2749948 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
22848 ec2-user  20   0 1034928  49656  38900 S  50.2  1.3   0:07.04 node
   12 root      20   0       0      0      0 I   0.3  0.0   0:00.89 rcu_sched
 1252 root      20   0  141860  77856  77280 S   0.3  2.1   0:52.56 systemd-journal

and perf top:

   31.86%  [kernel]                  [k] io_sq_thread
   30.72%  [kernel]                  [k] __io_sq_thread
   7.92%  [kernel]                  [k] _cond_resched
   4.15%  [kernel]                  [k] rcu_all_qs
   0.84%  [kernel]                  [k] finish_task_switch
   0.73%  [kernel]                  [k] _raw_spin_unlock_irqrestore
   0.73%  perf                      [.] __symbols__insert
   0.58%  [kernel]                  [k] do_user_addr_fault
   0.49%  [kernel]                  [k] audit_filter_syscall.constprop.0.isra.0
   0.39%  [kernel]                  [k] syscall_enter_from_user_mode
   0.39%  perf                      [.] rb_next
   0.29%  libc-2.26.so              [.] _int_malloc
Server information running on an AWS c5.large instance with latest Amazon Linux 2. The node process is running inside a docker container limited to .5 cpus and 1024M memory via `docker run` CLI flags. I can't really share the docker image/build because my company makes us use private base docker image, but it's a pretty basic ubuntu 22.04 set up with node 20 installed. kernel version: ``` 5.10.205-195.804.amzn2.x86_64 ``` cpu info ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Stepping: 7 CPU MHz: 3611.363 BogoMIPS: 5999.98 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0,1 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke ``` memory info: ``` dmidecode -t memory # dmidecode 3.2 Getting SMBIOS data from sysfs. SMBIOS 2.7 present. Handle 0x0008, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Unknown Maximum Capacity: 4 GB Error Information Handle: Not Provided Number Of Devices: 1 Handle 0x0009, DMI type 17, 34 bytes Memory Device Array Handle: 0x0008 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 4096 MB Form Factor: DIMM Set: None Locator: Not Specified Bank Locator: Not Specified Type: DDR4 Type Detail: Static Column Pseudo-static Synchronous Window DRAM Speed: 2933 MT/s Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Rank: Unknown Configured Memory Speed: Unknown ```

I hope this is enough info to work with. Sorry I can't share more of the container setup details.

santigimeno commented 6 months ago

So I've been looking at this a bit and I'm not so sure it's a io_uring issue and, afaict, neither it's dependent on the kernel version. My interpretation is that when running the posted code, it is sending requests continuously to the SQPOLL ring so the SQ kernel thread is never idle, which causes it uses 100% of the CPU it is attached to. This is not bad per-se but it can be very problematic in the case the cpu's available are few.

To demonstrate the issue I've run the following code, which is very similar to the previous one, in a docker container with cpus limited to 0.5,1,2,4 and 8

const fs = require('node:fs');

let times = 0;
const MAX = 1e5;

function write() {
  ++ times;
  fs.writeFile('log.txt', 'hello'.repeat(5000), () => { if (times < MAX) write(); });
}
write();

The results are quite telling:

image image

As can be seen, with 0.5 and 1 cpus the results are bad with io_uring enabled as the sqpoll thread is probably using most of them. As the cpus available increase, the perf also improves.

So it seems that with high i/o, using sqpoll increases CPU usage which can be problematic on environments with limited resources.

As an aside, I've been investigating using io_uring with and without SQPOLL and see how it behaves depending on the # of cpus.

I created the following program using liburing that makes something similar as the JS code posted above (though writing more data)

liburing_test.c ```c #include #include #include #include #include #include #include #define FILENAME "log.txt" #define BUFSIZE 1024 #define MAX_WRITES 10000000 int main() { struct io_uring ring; struct io_uring_params params = { 0 }; struct io_uring_cqe* cqe; struct io_uring_sqe* sqe; // Uncomment the following lines to use SQPOLL. // params.sq_thread_idle = 10; // params.flags = IORING_SETUP_SQPOLL; if (io_uring_queue_init_params(8, &ring, ¶ms) < 0) { perror("io_uring_queue_init_params"); return 1; } // Create a submission queue entry for open operation sqe = io_uring_get_sqe(&ring); int open_flags = O_CREAT | O_RDWR; int open_mode = 0666; io_uring_prep_openat(sqe, AT_FDCWD, FILENAME, open_flags, open_mode); // Submit the open request if (io_uring_submit(&ring) < 0) { perror("io_uring_submit open"); return 1; } // Wait for the completion of the open request if (io_uring_wait_cqe(&ring, &cqe) < 0) { perror("io_uring_wait_cqe open"); return 1; } // Process the completion if (cqe->res < 0) { fprintf(stderr, "open res: %d\n", cqe->res); return 1; } // Get the file descriptor from the completion result int file_fd = cqe->res; // File is now open, read data using another io_uring operation char buffer[BUFSIZE]; memset(buffer, 'a', BUFSIZE); // Create an epoll instance int epoll_fd = epoll_create1(0); if (epoll_fd == -1) { perror("epoll_create1"); return 1; } // Add the ring fd to the epoll instance struct epoll_event event; event.events = EPOLLIN; event.data.fd = ring.ring_fd; if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, ring.ring_fd, &event) == -1) { perror("epoll_ctl"); return 1; } // Create a submission queue entry for write operation sqe = io_uring_get_sqe(&ring); io_uring_prep_write(sqe, file_fd, buffer, BUFSIZE, 0); // Submit the write request if (io_uring_submit(&ring) < 0) { perror("io_uring_submit write"); return 1; } // Event loop int i = 0; struct epoll_event events[1024]; while (true && i < MAX_WRITES) { // Wait for the completion of the write request int n = epoll_pwait(epoll_fd, events, 1024, -1, NULL); if (n == -1) { if (errno == EINTR) { continue; } perror("epoll_pwait"); return 1; } // Process the completion struct io_uring_cqe *cqe; int ret = io_uring_peek_cqe(&ring, &cqe); if (ret == 0) { if (cqe->res < 0) { perror("io_uring write"); return 1; } // Clean up io_uring_cqe_seen(&ring, cqe); i++; // Create a submission queue entry for write operation sqe = io_uring_get_sqe(&ring); io_uring_prep_write(sqe, file_fd, buffer, BUFSIZE, 0); // Submit the write request if (io_uring_submit(&ring) < 0) { perror("io_uring_submit write"); return 1; } } } // Exit the queue after all writes are done io_uring_queue_exit(&ring); close(file_fd); return 0; } ```

These are the results I've observed with 0.5, 1, 2, 4 and 8 cpus. image image

It looks like for this test case with low cpu resources, not using SQPOLL behaves better than using it, with more cpu's the results are more balanced. Maybe it's worth investigating whether not using SQPOLL is a viable option for libuv as it might perform better in more constrained environments. @bnoordhuis thoughts?

bnoordhuis commented 6 months ago

"Normal" is io_uring but without sqpoll? My hunch is that the performance advantage vis-a-vis a properly tuned thread pool is likely a wash in that case.

Your observation that sqpoll doesn't work well with low cpu counts looks legit. Threads fighting for cpu time probably also happens with higher cpu counts when you create an event loop per core.

Suggestions on a way forward? I'm undecided.

santigimeno commented 6 months ago

"Normal" is io_uring but without sqpoll?

Yes

My hunch is that the performance advantage vis-a-vis a properly tuned thread pool is likely a wash in that case.

Agreed. Adding more to that, I locally modified libuv to not use sqpoll while not batching sqe's and the results are clear: with little concurrency io_uring is much better but as soon as concurrency increases, as expected, it's not even close, using the threadpool is much better. See the results below:

fs_stat benchmark non-sqpoll iouring ``` $ build/uv_run_benchmarks_a fs_stat ok 1 - fs_stat # 1,000,000 stats (sync): 0.92s (1,087,063/s) # 100,000 stats (1 concurrent): 0.51s (197,501/s) # 100,000 stats (2 concurrent): 0.28s (363,485/s) # 100,000 stats (3 concurrent): 0.17s (573,670/s) # 100,000 stats (4 concurrent): 0.15s (647,641/s) # 100,000 stats (5 concurrent): 0.16s (623,050/s) # 100,000 stats (6 concurrent): 0.16s (641,828/s) # 100,000 stats (7 concurrent): 0.15s (648,534/s) # 100,000 stats (8 concurrent): 0.15s (673,435/s) # 100,000 stats (9 concurrent): 0.15s (663,728/s) # 100,000 stats (10 concurrent): 0.15s (650,181/s) # 100,000 stats (11 concurrent): 0.16s (635,476/s) # 100,000 stats (12 concurrent): 0.16s (618,628/s) # 100,000 stats (13 concurrent): 0.16s (610,862/s) # 100,000 stats (14 concurrent): 0.16s (644,689/s) # 100,000 stats (15 concurrent): 0.15s (670,873/s) # 100,000 stats (16 concurrent): 0.16s (638,868/s) # 100,000 stats (17 concurrent): 0.18s (551,331/s) # 100,000 stats (18 concurrent): 0.17s (575,773/s) # 100,000 stats (19 concurrent): 0.15s (656,323/s) # 100,000 stats (20 concurrent): 0.15s (683,871/s) # 100,000 stats (21 concurrent): 0.15s (677,880/s) # 100,000 stats (22 concurrent): 0.17s (594,371/s) # 100,000 stats (23 concurrent): 0.18s (547,173/s) # 100,000 stats (24 concurrent): 0.18s (559,215/s) # 100,000 stats (25 concurrent): 0.18s (548,608/s) # 100,000 stats (26 concurrent): 0.18s (570,232/s) # 100,000 stats (27 concurrent): 0.18s (553,067/s) # 100,000 stats (28 concurrent): 0.14s (692,054/s) # 100,000 stats (29 concurrent): 0.18s (545,693/s) # 100,000 stats (30 concurrent): 0.18s (552,516/s) # 100,000 stats (31 concurrent): 0.19s (533,669/s) # 100,000 stats (32 concurrent): 0.18s (550,647/s) ```
fs_stat benchmark threadpool ``` $ UV_USE_IO_URING=0 build/uv_run_benchmarks_a fs_stat ok 1 - fs_stat # 1,000,000 stats (sync): 0.92s (1,085,412/s) # 100,000 stats (1 concurrent): 0.88s (113,333/s) # 100,000 stats (2 concurrent): 0.47s (214,738/s) # 100,000 stats (3 concurrent): 0.38s (262,202/s) # 100,000 stats (4 concurrent): 0.26s (390,541/s) # 100,000 stats (5 concurrent): 0.23s (428,278/s) # 100,000 stats (6 concurrent): 0.20s (497,530/s) # 100,000 stats (7 concurrent): 0.18s (554,535/s) # 100,000 stats (8 concurrent): 0.17s (581,954/s) # 100,000 stats (9 concurrent): 0.16s (619,875/s) # 100,000 stats (10 concurrent): 0.15s (646,310/s) # 100,000 stats (11 concurrent): 0.15s (675,292/s) # 100,000 stats (12 concurrent): 0.15s (679,115/s) # 100,000 stats (13 concurrent): 0.14s (695,454/s) # 100,000 stats (14 concurrent): 0.14s (737,272/s) # 100,000 stats (15 concurrent): 0.13s (761,384/s) # 100,000 stats (16 concurrent): 0.15s (646,959/s) # 100,000 stats (17 concurrent): 0.16s (633,597/s) # 100,000 stats (18 concurrent): 0.12s (839,909/s) # 100,000 stats (19 concurrent): 0.11s (880,327/s) # 100,000 stats (20 concurrent): 0.12s (842,439/s) # 100,000 stats (21 concurrent): 0.12s (848,435/s) # 100,000 stats (22 concurrent): 0.11s (873,406/s) # 100,000 stats (23 concurrent): 0.11s (888,450/s) # 100,000 stats (24 concurrent): 0.11s (894,715/s) # 100,000 stats (25 concurrent): 0.11s (896,347/s) # 100,000 stats (26 concurrent): 0.11s (878,707/s) # 100,000 stats (27 concurrent): 0.10s (964,384/s) # 100,000 stats (28 concurrent): 0.11s (885,058/s) # 100,000 stats (29 concurrent): 0.11s (915,147/s) # 100,000 stats (30 concurrent): 0.11s (932,465/s) # 100,000 stats (31 concurrent): 0.10s (970,667/s) # 100,000 stats (32 concurrent): 0.09s (1,080,835/s) ```

Your observation that sqpoll doesn't work well with low cpu counts looks legit. Threads fighting for cpu time probably also happens with higher cpu counts when you create an event loop per core.

Suggestions on a way forward? I'm undecided.

Because of the cpu usage issue, I don't think we should default to using SQPOLL even more taking into account that these cpu constrained containerized envs are a very typical way to deploy apps: it's no surprise so many reports have come from those kind of cases. OTOH, using a non-sqpoll ring without batching sqe's doesn't seem to give us any advantage as outlined before. Interestingly we do batch in the ctl ring. So, if batching is out of question due to latency issues, I don't see much advantage of having io_uring at all (maybe we could keep io_uring w SQPOLL as an opt-in for the cases in which it may make sense).

Crazy idea: maybe we could run the iou ring with sqe batching in a different thread thus avoiding the latency issues?

bnoordhuis commented 6 months ago

Ouch, those numbers are even worse than I expected. Okay, so sqpoll-less is off the table.

I'm fine in principle with changing io_uring to opt-in but I expect node users to start filing performance regressions if we do.

Interestingly we do batch in the ctl ring

Yes. For epoll_ctl that doesn't make an observable difference (save for fewer system calls) as long as mutations are visible to the kernel before we call epoll_pwait.

File operations on the other hand should start as soon as possible. Every microsecond delay is additional latency.