home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
70.55k stars 29.47k forks source link

Processor use shows 0% and 100% limits after updating to 2023.11 #103298

Closed Mariusthvdb closed 6 months ago

Mariusthvdb commented 9 months ago

The problem

since release 2023.11 processor use is suddenly showing 0% and 100% limits, still trying to figure out if this is only in restarts, or also during runtime.

Scherm­afbeelding 2023-11-03 om 12 01 53

in my system this never happened before, and the card below was always showing the limits displaying an actual percentage.

If anything, this is really annoying when trying to show a card like

Scherm­afbeelding 2023-11-03 om 12 20 59

rendering that useless really.

Its seems a bug, because how can usage ever be 0%? The 100% is also unexplained, even upon restart.

What version of Home Assistant Core has the issue?

2023.11

What was the last working version of Home Assistant Core?

2023.10

What type of installation are you running?

Home Assistant OS

Integration causing the issue

system monitor

Link to integration documentation on our website

https://www.home-assistant.io/integrations/systemmonitor/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

https://community.home-assistant.io/t/2023-11-to-do-add-release-title/634647/168?u=mariusthvdb https://community.home-assistant.io/t/2023-11-to-do-add-release-title/634647/90

home-assistant[bot] commented 9 months ago

systemmonitor documentation systemmonitor source

EDelsman commented 9 months ago

I see the same, on an RPI4, mostly in the period after a restart. I have no discernible ill effects or performance gain, just curious as 0 indeed seems unlikely., especially during reboot. Left side of the graph is before the update tp 2023.11, right is after. IMG_0622

Anto79-ops commented 9 months ago

Interesting, I reported this in beta chat in discord, as I'm definitely seeing the same.

Mariusthvdb commented 9 months ago

what was the dev response on that?

stalakerob commented 9 months ago

Just tried 2023.11.1. Still the same issue.

zSprawl commented 9 months ago

I’m seeing the same.

I had a random battery sensor show 100.0000000001% too so it makes me wonder if we got a round error or something. Just a guess though.

N3rdix commented 9 months ago

Maybe this is related to some changes in psutil or its buffer in general? I couldn't see a direct trigger, but couldn't the behaviour be improved according the the psutil docs (it recommends to call the cpu_percent at least a 2nd time to get accurate results)?

Code:

    elif type_ == "processor_use":
        state = round(psutil.cpu_percent(interval=None))
issue-triage-workflows[bot] commented 6 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

Mariusthvdb commented 6 months ago

still seeing this

gjohansson-ST commented 6 months ago

@Mariusthvdb still seeing exactly what and on which version?

gjohansson-ST commented 6 months ago

The faulty 0 return should be fixed already but not sure why there should be a fault in returning 100?

Mariusthvdb commented 6 months ago

I meant, still seeing the exact same behavior form the issue opening post.

regular 0% (even during runtime, so that can not be correct). the 100% also seems very unlikely, and has never been returned before, unless truly the case in a looping automation or so.....

but not on each and every startup?

Scherm­afbeelding 2024-02-14 om 21 48 57

min and max....

but check this, it shows exactly what is the issue:

Scherm­afbeelding 2024-02-14 om 21 53 35

btw I am running HA OS 2024.2.1

EDelsman commented 6 months ago

2024.2.1 Two reboots below, the second one I took care not to do anything special with HA afterward. The reboot was just for the sake of the graph. The period with weird peeks seems to last way longer than the reboot. IMG_0657

stalakerob commented 6 months ago

I'm also seeing CPU use dropping to 0 regularily. Running HA on Proxmox.

cpu_load

gjohansson-ST commented 6 months ago

The drop to 0 has been fixed but I think perhaps that's not coming until next patch release. However there is nothing really indicating it should be a problem raising to 100 so not sure what to do about that.

Mariusthvdb commented 6 months ago

could you please explain why we see this from the mentioned update on, as this was never experienced before?

Asking, because if you might believe we had these 100% before, I would have to state that was not the case. and it is a remarkable change, breaking backend templates/automations etc.

gjohansson-ST commented 6 months ago

Regarding the 0 output it came up as another issue and then I looked at it briefly in the past code but mainly on psutil documentation which clearly says and output of 0 is faulty and should be ignored. So that was a bug that has already been solved (which I didn't really look at the reason this was missed during the implementation of the coordinator).

However with the 100 output I mean I don't know as there is nothing so far I could see which would result in this and I don't think ignoring the 100 is the way to go either.

The frequency of the sensor is every 15s (I believe), could you try to change this to 20/30 or something to see what behavior we get?

EDelsman commented 6 months ago

Not hindered by in depth knowledge of how this works, so I may be totally wrong. But if we have unexpected lows and unexpected highs in a time sampled measurement, then my first thought is that the 0 are points where measurements are missed, that are then counted later on at the unexpected highs. In that case, ignoring the 0's and keeping the highs would raise the average percentage, thus misrepresenting the situation.

This combined with the fact the problem wasn't there before, and that it now only occurs in the hour after a reboot, makes me wonder how that can be. I understand that a reboot causes high cpu usage, but for an hour when there's no actual change to the system?

If0's are expected from the measurement, then why do the 0's only happen in that first hour? Also, for what I've read 0 is expected only for the first measurement because there aren't any samples yet. But it's not something you'd ignore on a regular basis. I can imagine that cpu measurement is hard when the system is very busy, but it kind of defeats the purpose of cpu measurement if that is the case. And after an hour the 0's are gone, together with the highs.

Mariusthvdb commented 6 months ago

not yet investigated in depth, but latest dev 2024.3.0.dev20240216 makes the processor use entity go unknown for the moments we saw 0% before.

Seems hardly an improvement tbh... no error in log

Scherm­afbeelding 2024-02-16 om 12 38 12 Scherm­afbeelding 2024-02-16 om 12 38 18
gjohansson-ST commented 6 months ago

unknown is not unavailable

Mariusthvdb commented 6 months ago

sorry for that typo. edited that above

hope that is not all of the response though... You do see the issue we're facing here, not sure what else to add now

other than my instance is running for an hour now, and the unknown is still reported

Scherm­afbeelding 2024-02-16 om 13 14 41
garry0garry commented 6 months ago

This combined with the fact the problem wasn't there before, and that it now only occurs in the hour after a reboot,

I think the problem can be checked using this script:

SELECT strftime('%Y-%m-%d %H:%M:%f', states.last_updated_ts, 'unixepoch', 'localtime') AS 'Time', states.state AS 'Value',  states_meta.entity_id AS 'Entity' 
FROM states 
JOIN states_meta ON states.metadata_id = states_meta.metadata_id 
WHERE states_meta.entity_id = 'sensor.processor_use' 
AND strftime('%Y-%m-%d %H:%M:%f', last_updated_ts, 'unixepoch', 'localtime') BETWEEN '2024-02-19 12:30:00' AND '2024-02-19 13:30:00'
AND states.state = 'unknown';

Where: '2024-02-19 12:30:00' - Home Assistant start time '2024-02-19 13:30:00' - Home Assistant start time + 60 min

garry0garry commented 6 months ago

Or this automation that writes to the log:

- alias: CPU load
  trigger:
    platform: time_pattern
    minutes: /1
  action:
    - service: system_log.write
      data:
        message: >-
          CPU load {{ states('sensor.processor_use') }}%.
        level: warning
gjohansson-ST commented 6 months ago

Hi. First off I acknowledge the problem and I have the same thing on my prod (but not on dev). We're not going to roll back but obviously a solution needs to be found somehow.

So it's work in progress to get to the root cause of the issue and get a permanent fix in order to resolve it.

No need to get more posts about reproducing the issue unless there is constructive proposals on how to fix the issue.

Thanks

gjohansson-ST commented 6 months ago

So long story short psutil became thread aware and since cpu percent (among others) isn't implemented to only run in the main thread it's therefore goes to 0 sometimes which is a false value hence why it's setting unknown as state.

A fix is coming (PR has been linked) which I hope can be managed and implemented shortly.

garry0garry commented 6 months ago

Do I understand correctly that the fix was not included in the 2024.2.3 release?

2024-02-23 23:41:13.532 unknown sensor.processor_use
2024-02-23 23:41:58.532 5   sensor.processor_use
2024-02-23 23:42:13.532 unknown sensor.processor_use
2024-02-23 23:42:28.532 14  sensor.processor_use
2024-02-23 23:42:43.533 unknown sensor.processor_use
2024-02-23 23:43:28.532 5   sensor.processor_use
2024-02-23 23:43:43.532 unknown sensor.processor_use
2024-02-23 23:45:58.531 7   sensor.processor_use
2024-02-23 23:46:13.533 unknown sensor.processor_use
2024-02-23 23:46:28.535 8   sensor.processor_use
2024-02-23 23:46:43.532 unknown sensor.processor_use
2024-02-23 23:47:28.531 7   sensor.processor_use
2024-02-23 23:47:43.532 unknown sensor.processor_use
2024-02-23 23:48:28.532 8   sensor.processor_use
2024-02-23 23:48:43.531 unknown sensor.processor_use
2024-02-23 23:49:28.532 8   sensor.processor_use
2024-02-23 23:49:58.533 unknown sensor.processor_use
2024-02-23 23:50:28.530 8   sensor.processor_use
2024-02-23 23:50:43.532 unknown sensor.processor_use
2024-02-23 23:51:28.533 8   sensor.processor_use
2024-02-23 23:51:58.531 unknown sensor.processor_use
2024-02-23 23:52:13.531 8   sensor.processor_use
2024-02-23 23:52:28.532 unknown sensor.processor_use
2024-02-23 23:52:43.533 8   sensor.processor_use
2024-02-23 23:52:58.530 unknown sensor.processor_use
2024-02-23 23:53:13.533 8   sensor.processor_use
2024-02-23 23:53:28.531 unknown sensor.processor_use
2024-02-23 23:53:43.531 7   sensor.processor_use
2024-02-23 23:53:58.532 8   sensor.processor_use
2024-02-23 23:54:13.532 unknown sensor.processor_use
2024-02-23 23:54:28.531 7   sensor.processor_use
2024-02-23 23:54:43.531 unknown sensor.processor_use
2024-02-23 23:54:58.530 8   sensor.processor_use
2024-02-23 23:55:13.533 unknown sensor.processor_use
2024-02-23 23:55:28.531 8   sensor.processor_use
2024-02-23 23:55:43.531 unknown sensor.processor_use
2024-02-23 23:56:13.531 9   sensor.processor_use
2024-02-23 23:56:28.530 8   sensor.processor_use
gjohansson-ST commented 6 months ago

As 2024.2.3 was released yesterday and this was fixed like an hour ago then yes, it was n't included. As beta starts on Wednesday not sure if there will be another patch before releasing 2024.3 so we'll see.

erkr commented 6 months ago

Thanks for fixing, now just wait for 24.3

Mariusthvdb commented 6 months ago

@gjohansson-ST , I do believe the issues are gone! the 0% was already taken out but replaced with unknown. the latest pr was in Dev, which I just installed and have a look:

Scherm­afbeelding 2024-02-24 om 20 31 58

the 100% peaks at startup are gone now too.

(regular processor % also went down significantly, but that has to do with improvements elsewhere...)

nice.

akicker commented 5 months ago

getting worse in 2024.2.4 mostly unknown! image

gjohansson-ST commented 5 months ago

The fix isn't coming until 2024.3 so at this point either turn the sensor off or ignore it. Nothing to do until next release to have this fixed.

Mariusthvdb commented 5 months ago

proof!

Scherm­afbeelding 2024-02-27 om 16 23 52