High io_wait CPU usage - Githubissues

Rolly992 commented 2 months ago

When switching an application to use io_uring as a transport, we observed a huge increase in the CPU io_wait time, going from 0.01% to ~95% for the same load. Overall CPU usage however stays low, around 4% per core, as you can see on the charts below. The app was running on a relatively fresh OS and kernel version.

Is such increase in io_wait expected? What is the best way to further diagnose it? Recording JFR didn't give much, the Socket IO events are missing.

As for the side effects of swtiching to io_uring, the app started to have much more timeouts - both for incoming and outgoing http requests. For example, timeouts in both serving the usual and the metrics requests are observed, even the later is supposed to be fast.

IO_uring enabled:

IO_uring disabled:

JRE vendor/version: java-21-amazon-corretto Exact version: 21.0.3+9-LTS IO_uring version: 0.0.25.Final, x86-64 Netty version: 4.1.107-109 OS version: Ubuntu 22.04.4 LTS, at AWS EC2 Kernel version: 6.5.0-1018-aws

franz1981 commented 2 months ago

Probably async-profiler can help there, but first, quick check:

are you using HTTP 1.1, am I right?
is it using file I/O or anything? what the test is doing?
how many event loops vs physical CPUs?
how many connections are active concurrently?

This change https://github.com/netty/netty-incubator-transport-io_uring/commit/b047ceca178996fec707b0820807e11edfbd8792#diff-e660183916ec399e605aac22a0864bd4b02ee986c85f5d96256872b5cf7f6783R49 should have exactly the opposite effect of what you're seeing and it should be within 0.0.25.Final, x86-64 actually, so it surprise me a bit...

Rolly992 commented 2 months ago

Yeah, we are using HTTP 1.1.
It is heavily using file IO - it's storing and serving files from/to local disk via HTTP interface.
There are 8 event loops vs 4 CPUs (2 physical CPUs).
Number of concurrent connections - roughly between 50 and 200.

I've also double-checked and it turned out what I've been using io_uring v 0.0.24. I switched it to 0.0.25 and it didn't improve io_wait times.

Surprisingly I see the same picture even when the service is absolutely idle - when no traffic is routed to it and no background tasks are running. Even in this case the io_wait time is above 90%.. This applies both for 0.0.24 and 0.0.25.

franz1981 commented 2 months ago

Ok so found the reason. NIO and epoll can use file zero copy transfer while io_uring cannot (search for the same issue in this repo).

franz1981 commented 2 months ago

I see the same picture even when the service is absolutely idle

I would profile with async profiler or some perf top -a

Rolly992 commented 2 months ago

I've profiled using async profiler, profiling the CPU and wall-clock. The wall clock looks almost identical, while the CPU has a bit more difference.

The difference for wall-clock is expectedly what the compute threads are busy (waiting?) with corresponding transport implementations.

IO_uring:

Epoll:

I couldn't find any meaningful difference for the CPU profile though 1 flame graph "tower" is based on the EpollEventLoop.run and the other is based on the IOUringEvenLoop.run

Is there any particular information which could help to investigate it better? Could it be that for the function io_uring is calling the io_wait is an expected state and is not an issue?

franz1981 commented 2 months ago

Yep, it seems that both are waiting. Meaning that there is not enough network activity to perform further I/O

Rolly992 commented 6 days ago

Thanks for all of the suggestions and answers!

And in case of the load, the one which can normally handled by the epoll, does the absence of a file zero copy transfer explain timeouts, in both service and even in http metrics endpoints? Or could it explain it, at least theoretically? Does it mean that io_uring performance gains are not utilized when working with files (writing / reading)?

Ok so found the reason. NIO and epoll can use file zero copy transfer while io_uring cannot (search for the same issue in this repo).

netty / netty-incubator-transport-io_uring

High io_wait CPU usage #247

IO_uring:

Epoll: