Open renweihang opened 10 months ago
It seems to be related to sdbusplus. the problem occurs whenever sdbusplus is used. The following is the debugging code. Using sdbusplus:
Not using sdbusplus:
PS: The OpenBMC commit I am using is 6fddef299932b1270a799e78566e25daa911f742
So, I opened a new issue at https://github.com/openbmc/sdbusplus/issues/92 Hope to get your help,thanks a lot !!
What platform was this tested on?
What platform was this tested on?
meta-g220a
And I Used a new commit , The issue still there: OpenBmc Commit: 1f0056e138d1eb872784fc20c21e1e340d64a74c (Fri Dec 15 17:20:20 2023 -0600) dbus-sensors Commit: https://github.com/openbmc/dbus-sensors/commit/28b88233a598ff64c073e2aaf5d178da17e31b91
First off, this file shouldn't be in the meta layer. Issues with this file would've been caught earlier by CI if it had been put in the right place.
I see a large number of very "expensive" to read sensors. Considering this is an ast2500, it seems very likely that the io load you're seeing is real, and a result of too much IO being done on that platform with that configuration. I also see a number of config stanzas that are just unsupported by upstream (like pmem). How certain are you that you tested this on an upstream build?
To triage, I would start by removing the various config types, until you find the one that's causing the most contention, then look at what you can do to increase the performance of those sensor types. It's very likely that you just need to optimize your platforms read rates to account for the bandwidth of your i2c lanes, especially for pmbus devices, which are non-trivial to read.
Note, that a high iowait percentage is not a bug in itself. It was likely that in the past this platform was just blocking in userspace, and sensors were scanning slower than specified in the config file. When we moved to uring, now that same contention shows up as iowait instead of silently happening in userspace. This doesn't mean that the actual sensor scan rates are any worse than it was before. In fact, they're likely better because of uring, but do make this problem more aparent.
Good luck with your debug. Let us know what your findings are, and if we can transfer this bug to be g220 specific.
First off, this file shouldn't be in the meta layer. Issues with this file would've been caught earlier by CI if it had been put in the right place.
I see a large number of very "expensive" to read sensors. Considering this is an ast2500, it seems very likely that the io load you're seeing is real, and a result of too much IO being done on that platform with that configuration. I also see a number of config stanzas that are just unsupported by upstream (like pmem). How certain are you that you tested this on an upstream build?
To triage, I would start by removing the various config types, until you find the one that's causing the most contention, then look at what you can do to increase the performance of those sensor types. It's very likely that you just need to optimize your platforms read rates to account for the bandwidth of your i2c lanes, especially for pmbus devices, which are non-trivial to read.
Note, that a high iowait percentage is not a bug in itself. It was likely that in the past this platform was just blocking in userspace, and sensors were scanning slower than specified in the config file. When we moved to uring, now that same contention shows up as iowait instead of silently happening in userspace. This doesn't mean that the actual sensor scan rates are any worse than it was before. In fact, they're likely better because of uring, but do make this problem more aparent.
Good luck with your debug. Let us know what your findings are, and if we can transfer this bug to be g220 specific.
Thanks a lot, it is indeed related to io_uring
So, As you said, this is not a bug? Just a feature of io_uring? Do we need to pay attention to this issue anymore? If left unattended, will the low CPU idle affect the normal use of other processes?
iowait will drop when revert this kernel commit "io_uring: Use io_schedule* in cqring " https://github.com/openbmc/linux/commit/f32dfc802e8733028088edf54499d5669cb0ef69
The linked patch is a change to accounting more than anything else. I don't think it's particularly concerning?
https://lore.kernel.org/lkml/538065ee-4130-6a00-dcc8-f69fbc7d7ba0@kernel.dk/
After BMC normal startup, check the CPU usage:
then stop all sensor service , used the follow command:
systemctl stop xyz.openbmc_project.hwmontempsensor.service
systemctl stop xyz.openbmc_project.fansensor.service
systemctl stop xyz.openbmc_project........service
......
Check the CPU usage again:
Even if I just started one sensor hwmon service(xyz.openbmc_project.hwmontempsensor.service), and without any sensor, this issue still here
The following are the situations before and after stopping the hwmon service