openbmc / dbus-sensors

D-Bus configurable sensor scanning applications
Apache License 2.0
27 stars 45 forks source link

iowait consuming too much cpu resources #30

Open renweihang opened 8 months ago

renweihang commented 8 months ago

After BMC normal startup, check the CPU usage: image

then stop all sensor service , used the follow command: systemctl stop xyz.openbmc_project.hwmontempsensor.service systemctl stop xyz.openbmc_project.fansensor.service systemctl stop xyz.openbmc_project........service ......

Check the CPU usage again: image

Even if I just started one sensor hwmon service(xyz.openbmc_project.hwmontempsensor.service), and without any sensor, this issue still here image

The following are the situations before and after stopping the hwmon service image

renweihang commented 8 months ago

It seems to be related to sdbusplus. the problem occurs whenever sdbusplus is used. The following is the debugging code. Using sdbusplus: image

Not using sdbusplus: image

PS: The OpenBMC commit I am using is 6fddef299932b1270a799e78566e25daa911f742

So, I opened a new issue at https://github.com/openbmc/sdbusplus/issues/92 Hope to get your help,thanks a lot !!

edtanous commented 8 months ago

What platform was this tested on?

renweihang commented 8 months ago

What platform was this tested on?

meta-g220a

And I Used a new commit , The issue still there: OpenBmc Commit: 1f0056e138d1eb872784fc20c21e1e340d64a74c (Fri Dec 15 17:20:20 2023 -0600) dbus-sensors Commit: https://github.com/openbmc/dbus-sensors/commit/28b88233a598ff64c073e2aaf5d178da17e31b91

edtanous commented 8 months ago

Looking at https://github.com/openbmc/meta-bytedance/blob/master/meta-g220a/recipes-phosphor/configuration/entity-manager/g220a_baseboard.json

First off, this file shouldn't be in the meta layer. Issues with this file would've been caught earlier by CI if it had been put in the right place.

I see a large number of very "expensive" to read sensors. Considering this is an ast2500, it seems very likely that the io load you're seeing is real, and a result of too much IO being done on that platform with that configuration. I also see a number of config stanzas that are just unsupported by upstream (like pmem). How certain are you that you tested this on an upstream build?

To triage, I would start by removing the various config types, until you find the one that's causing the most contention, then look at what you can do to increase the performance of those sensor types. It's very likely that you just need to optimize your platforms read rates to account for the bandwidth of your i2c lanes, especially for pmbus devices, which are non-trivial to read.

Note, that a high iowait percentage is not a bug in itself. It was likely that in the past this platform was just blocking in userspace, and sensors were scanning slower than specified in the config file. When we moved to uring, now that same contention shows up as iowait instead of silently happening in userspace. This doesn't mean that the actual sensor scan rates are any worse than it was before. In fact, they're likely better because of uring, but do make this problem more aparent.

Good luck with your debug. Let us know what your findings are, and if we can transfer this bug to be g220 specific.

renweihang commented 8 months ago

Looking at https://github.com/openbmc/meta-bytedance/blob/master/meta-g220a/recipes-phosphor/configuration/entity-manager/g220a_baseboard.json

First off, this file shouldn't be in the meta layer. Issues with this file would've been caught earlier by CI if it had been put in the right place.

I see a large number of very "expensive" to read sensors. Considering this is an ast2500, it seems very likely that the io load you're seeing is real, and a result of too much IO being done on that platform with that configuration. I also see a number of config stanzas that are just unsupported by upstream (like pmem). How certain are you that you tested this on an upstream build?

To triage, I would start by removing the various config types, until you find the one that's causing the most contention, then look at what you can do to increase the performance of those sensor types. It's very likely that you just need to optimize your platforms read rates to account for the bandwidth of your i2c lanes, especially for pmbus devices, which are non-trivial to read.

Note, that a high iowait percentage is not a bug in itself. It was likely that in the past this platform was just blocking in userspace, and sensors were scanning slower than specified in the config file. When we moved to uring, now that same contention shows up as iowait instead of silently happening in userspace. This doesn't mean that the actual sensor scan rates are any worse than it was before. In fact, they're likely better because of uring, but do make this problem more aparent.

Good luck with your debug. Let us know what your findings are, and if we can transfer this bug to be g220 specific.

Thanks a lot, it is indeed related to io_uring image

So, As you said, this is not a bug? Just a feature of io_uring? Do we need to pay attention to this issue anymore? If left unattended, will the low CPU idle affect the normal use of other processes?

y11627 commented 5 months ago

iowait will drop when revert this kernel commit "io_uring: Use io_schedule* in cqring " https://github.com/openbmc/linux/commit/f32dfc802e8733028088edf54499d5669cb0ef69

amboar commented 5 months ago

The linked patch is a change to accounting more than anything else. I don't think it's particularly concerning?

https://lore.kernel.org/lkml/538065ee-4130-6a00-dcc8-f69fbc7d7ba0@kernel.dk/