Closed KIC-8462852 closed 1 year ago
Can you check why/where it's using so much CPU time? perf top
should give a clue.
FWIW, the 5.x kernels have had several io_uring resource consumption bugs so it's quite possible a kernel upgrade makes the problem go away.
edit: you can test by setting UV_USE_IO_URING=0
in the environment
I think you are right about the kernel. With kernel 5.10.170 I can reproduce the problem, but not with 5.10.186... there are several io_uring
patches in between... By the way, the perf top
when CPU is at 100%:
72.76% [kernel] [k] io_sq_thread
11.61% [kernel] [k] _cond_resched
8.76% [kernel] [k] rcu_all_qs
5.82% [kernel] [k] __x86_return_thunk
Thx.
Thanks for testing. I've opened #4093 to blacklist pre-5.10.186 kernels.
Seeing what I believe is the same issue during load testing in a node.js-based service with v20.11.0 (so libuv 1.47.0 I believe), but on a slightly newer kernel version (5.10.201).
Running the service with UV_USE_IO_URING=0
drops CPU in our load tests back to where we expect it (see before and after):
Wondering if the version here needs tweaked https://github.com/libuv/libuv/blob/3b6a1a14caeeeaf5510f2939a8e28ed9ba0ad968/src/unix/linux.c#L477-L495
This service is running on AWS Fargate, so I can't easily upgrade kernel versions or anything like that - planning on reporting the same issue to AWS.
uname -r
:
5.10.201-191.748.amzn2.x86_64
I can try grabbing a CPU profile if desired.
changelogs mentioning io_uring
between 186 and 201:
https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.188
https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.190
https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.195
after 201 mentioning io_uring
:
https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.202
https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.203
https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.10.204
I can try testing these to try to narrow down the version which may have re-introduced the issue.
At a quick glance none of those 5.10.2xx releases contain relevant bug fixes. I guess we'd have to blacklist all 5.10.x kernels if we can't distinguish between good and bad kernels.
@santigimeno WDYT?
@bienzaaron how easy is it to reproduce the issue? Do you have some code you can share? If possible I'd love to take a look before blacklisting more kernels.
the following reproduces it for me on the latest available Amazon Linux 2 kernel (5.10.205-195.804.amzn2.x86_64
):
const fs = require('node:fs');
function append() {
fs.appendFile('log.txt', 'hello'.repeat(5000), () => {});
setTimeout(append, 10);
}
append();
running with UV_USE_IO_URING=0
gives minimal cpu usage as reported by top:
top - 16:42:57 up 44 min, 1 user, load average: 1.66, 0.98, 0.67
Tasks: 184 total, 1 running, 130 sleeping, 0 stopped, 1 zombie
%Cpu(s): 2.7 us, 3.8 sy, 0.2 ni, 93.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.2 st
KiB Mem : 3787712 total, 265912 free, 746704 used, 2775096 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 2754936 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1252 root 20 0 133668 74004 73428 S 3.3 2.0 0:52.11 systemd-journal
22614 ec2-user 20 0 1034648 48448 38832 S 2.3 1.3 0:00.26 node
3863 root 20 0 810308 51436 49104 S 1.7 1.4 0:20.58 rsyslogd
3893 root 20 0 830692 16624 6228 S 1.0 0.4 0:18.41 nxlog
and perf top:
6.75% perf [.] __symbols__insert
3.37% perf [.] rb_next
2.75% libc-2.26.so [.] __close
2.15% [kernel] [k] syscall_enter_from_user_mode
2.00% [kernel] [k] finish_task_switch
1.74% [kernel] [k] audit_filter_syscall.constprop.0.isra.0
1.60% [kernel] [k] cshook_systemcalltable_pre_compat_sys_ioctl
1.48% [kernel] [k] do_user_addr_fault
1.48% libc-2.26.so [.] _int_malloc
without the env var set (I am running this containerized with CPU throttled to half a core, which is why it's only 50), see CPU usage jump significantly in both top
and perf top
:
top - 16:43:30 up 44 min, 1 user, load average: 1.42, 0.98, 0.68
Tasks: 182 total, 1 running, 133 sleeping, 0 stopped, 1 zombie
%Cpu(s): 1.3 us, 25.0 sy, 0.0 ni, 73.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 3787712 total, 260188 free, 751752 used, 2775772 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 2749948 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22848 ec2-user 20 0 1034928 49656 38900 S 50.2 1.3 0:07.04 node
12 root 20 0 0 0 0 I 0.3 0.0 0:00.89 rcu_sched
1252 root 20 0 141860 77856 77280 S 0.3 2.1 0:52.56 systemd-journal
and perf top:
31.86% [kernel] [k] io_sq_thread
30.72% [kernel] [k] __io_sq_thread
7.92% [kernel] [k] _cond_resched
4.15% [kernel] [k] rcu_all_qs
0.84% [kernel] [k] finish_task_switch
0.73% [kernel] [k] _raw_spin_unlock_irqrestore
0.73% perf [.] __symbols__insert
0.58% [kernel] [k] do_user_addr_fault
0.49% [kernel] [k] audit_filter_syscall.constprop.0.isra.0
0.39% [kernel] [k] syscall_enter_from_user_mode
0.39% perf [.] rb_next
0.29% libc-2.26.so [.] _int_malloc
I hope this is enough info to work with. Sorry I can't share more of the container setup details.
So I've been looking at this a bit and I'm not so sure it's a io_uring issue and, afaict, neither it's dependent on the kernel version. My interpretation is that when running the posted code, it is sending requests continuously to the SQPOLL
ring so the SQ kernel thread is never idle, which causes it uses 100% of the CPU it is attached to. This is not bad per-se but it can be very problematic in the case the cpu's available are few.
To demonstrate the issue I've run the following code, which is very similar to the previous one, in a docker container with cpus limited to 0.5,1,2,4 and 8
const fs = require('node:fs');
let times = 0;
const MAX = 1e5;
function write() {
++ times;
fs.writeFile('log.txt', 'hello'.repeat(5000), () => { if (times < MAX) write(); });
}
write();
The results are quite telling:
As can be seen, with 0.5 and 1 cpus the results are bad with io_uring enabled as the sqpoll thread is probably using most of them. As the cpus available increase, the perf also improves.
So it seems that with high i/o, using sqpoll increases CPU usage which can be problematic on environments with limited resources.
As an aside, I've been investigating using io_uring with and without SQPOLL
and see how it behaves depending on the # of cpus.
I created the following program using liburing
that makes something similar as the JS code posted above (though writing more data)
These are the results I've observed with 0.5, 1, 2, 4 and 8 cpus.
It looks like for this test case with low cpu resources, not using SQPOLL behaves better than using it, with more cpu's the results are more balanced. Maybe it's worth investigating whether not using SQPOLL is a viable option for libuv as it might perform better in more constrained environments. @bnoordhuis thoughts?
"Normal" is io_uring but without sqpoll? My hunch is that the performance advantage vis-a-vis a properly tuned thread pool is likely a wash in that case.
Your observation that sqpoll doesn't work well with low cpu counts looks legit. Threads fighting for cpu time probably also happens with higher cpu counts when you create an event loop per core.
Suggestions on a way forward? I'm undecided.
"Normal" is io_uring but without sqpoll?
Yes
My hunch is that the performance advantage vis-a-vis a properly tuned thread pool is likely a wash in that case.
Agreed. Adding more to that, I locally modified libuv to not use sqpoll while not batching sqe's and the results are clear: with little concurrency io_uring is much better but as soon as concurrency increases, as expected, it's not even close, using the threadpool is much better. See the results below:
Your observation that sqpoll doesn't work well with low cpu counts looks legit. Threads fighting for cpu time probably also happens with higher cpu counts when you create an event loop per core.
Suggestions on a way forward? I'm undecided.
Because of the cpu usage issue, I don't think we should default to using SQPOLL
even more taking into account that these cpu constrained containerized envs are a very typical way to deploy apps: it's no surprise so many reports have come from those kind of cases. OTOH, using a non-sqpoll ring without batching sqe's doesn't seem to give us any advantage as outlined before. Interestingly we do batch in the ctl
ring. So, if batching is out of question due to latency issues, I don't see much advantage of having io_uring at all (maybe we could keep io_uring w SQPOLL as an opt-in for the cases in which it may make sense).
Crazy idea: maybe we could run the iou
ring with sqe batching in a different thread thus avoiding the latency issues?
Ouch, those numbers are even worse than I expected. Okay, so sqpoll-less is off the table.
I'm fine in principle with changing io_uring to opt-in but I expect node users to start filing performance regressions if we do.
Interestingly we do batch in the ctl ring
Yes. For epoll_ctl that doesn't make an observable difference (save for fewer system calls) as long as mutations are visible to the kernel before we call epoll_pwait.
File operations on the other hand should start as soon as possible. Every microsecond delay is additional latency.
Version: 1.46.0 Platform: Linux 5.10.186 #1 SMP Sat Jul 8 18:25:56 WEST 2023 i686 GNU/Linux
Hi, I had bind 9.18.16 compiled with libuv 1.44.2 working without any issues. After compiling and installing libuv 1.46.0, the named process was continuously using 100% CPU. I reverted to version 1.44.2 resuming normality. I reproduced it twice with the same results. So it doesn't seem like the problem is with bind itself. Thx.