home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
69.03k stars 28.28k forks source link

statistics on 2024.5.0 causing increased CPU usage and high temps #116650

Closed myromeo closed 1 week ago

myromeo commented 2 weeks ago

The problem

Updated to 2024.5.0 and noted an increase in CPU usage (around 5% higher) resulting in CPU temps increasing by 30%+. Rolled back to 2024.4.4 and usage and temps immediately dropped.

What version of Home Assistant Core has the issue?

core-2024.5.0

What was the last working version of Home Assistant Core?

core-2024.4.4

What type of installation are you running?

Home Assistant OS

Integration causing the issue

No response

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Can’t see any errors and everything is running.

Additional information

X64, intel i5-4590T on Lenovo m93p tiny.

bdraco commented 2 weeks ago

Please post a callgrind file from the profiler

https://my.home-assistant.io/redirect/config_flow_start/?domain=profiler https://my.home-assistant.io/redirect/developer_call_service/?service=profiler.start

nijel commented 2 weeks ago

I observe the same symptoms, callgrind is here:

callgrind.out.1714720380542446.txt

If I read it correctly, it seems to be caused by statistics platform sensors in my case.

home-assistant[bot] commented 2 weeks ago

Hey there @thomdietrich, mind taking a look at this issue as it has been labeled with an integration (statistics) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `statistics` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign statistics` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


statistics documentation statistics source (message by IssueLinks)

bdraco commented 2 weeks ago

If I read it correctly, it seems to be caused by statistics platform sensors in my case.

Spot on 72 million calls to total_seconds

stats

Screenshot 2024-05-03 at 2 26 47 AM
bdraco commented 2 weeks ago

https://github.com/home-assistant/core/pull/110438 will probably fix it, but I never did get time to get back to iterate on that PR

nijel commented 2 weeks ago

Any clue why this appeared with the 2024.5? The statistics code doesn't seem to be changed recently.

bdraco commented 2 weeks ago

Its been a problem for a while, and seems to come up every time the timings change a bit, see https://github.com/home-assistant/core/issues/110433

I'm not exactly sure what triggers the issue as I've never been able to fully replicate it to the degree you have in your profile.

bdraco commented 2 weeks ago

_initialize_from_database could be refactored to only look at the last state to set the attributes instead of doing the calculations inside of _add_state_to_queue for every previous state and throwing it away.

That would probably help as well

nijel commented 2 weeks ago

In my case, it seems to be caused by 30 minute average_linear statistics for a frequently updated sensor. I should probably look for a better way to achieve the same.

PS: After changing a bit calculation formula for the originating sensor, I'm down to 12M calls of total_seconds and my system no longer runs fans on full power. Still, this is the most CPU intensive operation my Home Assistant does.

ayavilevich commented 2 weeks ago

I am having an issue with statistics since upgrade to 2024.5.0 too.

Here is the CPU graph: image

It starts normal after a restart but then after a few hours gradually becomes CPU over loaded.

I ran the profiler. High CPU places are:

statistics.py:154(_sum) ~:0(<method 'as_integer_ratio' of 'float' objects>) statistics.py:287(_exact_ratio)

image

I use statistics a lot but didn't have an issue before. Going to rollback now.

bdraco commented 2 weeks ago

I think the performance problems could be solved, by breaking out the code into a new object, and then breaking that out into a package, it could also be Cythonized to do all the calculations in native code.

Another alternative would be to rewrite the expensive operations in numpy

myromeo commented 2 weeks ago

callgrind.out.1714751198628950.txt

As requested callgrind file. How do I make sense of this file as I have no idea what I am looking at!

ayavilevich commented 2 weeks ago

callgrind.out.1714751198628950.txt

As requested callgrind file. How do I make sense of this file as I have no idea what I am looking at!

See https://www.home-assistant.io/integrations/profiler

For example, you can use SnakeViz to render it and see what functions consume most of the CPU.

Correction, SnakeViz is for the .cprof. There are other software for the callgrind.

ThomDietrich commented 2 weeks ago

Hey guys, hey @bdraco, this sounds horrible. Any idea what caused that issue? My first guess would be that this might be related to the introduction of state_reported events, however the statistics component doesn't use that one yet. I'm thinking there might have been another change that lead to this.

bdraco commented 2 weeks ago

@ayavilevich

callgrind.out.1714751198628950.txt

As requested callgrind file. How do I make sense of this file as I have no idea what I am looking at!

Screenshot 2024-05-03 at 3 39 21 PM

Looks like this case of many sqlite queries.

Can you turn on info (maybe even debug) logging for sqlalchemy as well?

myromeo commented 2 weeks ago

how do I do that?

bdraco commented 2 weeks ago
service: logger.set_level
data:
  sqlalchemy: info
Screenshot 2024-05-03 at 3 45 42 PM
bdraco commented 2 weeks ago

Hey guys, hey @bdraco, this sounds horrible. Any idea what caused that issue? My first guess would be that this might be related to the introduction of state_reported events, however the statistics component doesn't use that one yet. I'm thinking there might have been another change that lead to this.

I'm not really sure whats going on at this point as it seems all the profiles have a lot of database access going on, but AFAICT _initialize_from_database is only ever called once per entity.

bdraco commented 2 weeks ago

The only thing I can think is the database access is unrelated and because of the busy system otherwise as _initialize_from_database doesn't appear in the profile

bdraco commented 2 weeks ago

Maybe there is some type of race in _next_to_purge_timestamp and its scheduling the next update too soon so its updating over and over again

https://github.com/home-assistant/core/blob/81a269f4f54c039a82f460cad526ff644b65b242/homeassistant/components/statistics/sensor.py#L548

bdraco commented 2 weeks ago
service: logger.set_level
data:
  sqlalchemy: info
  homeassistant.components.statistics: debug

It would be good to get debug logs for statistics as well so we can tell if its the scheduling update firing too often

myromeo commented 2 weeks ago

@bdraco I've set that, what next?

ayavilevich commented 2 weeks ago

I am not sure if that is relevant, but since you mentioned database activity, I would like to add that I am using MariaDB as the database for the recorder. Is that common to others who are having the issue?

bdraco commented 2 weeks ago

@bdraco I've set that, what next?

Please post the log (remove anything sensitive)

dannytsang commented 2 weeks ago

home-assistant.log

bdraco commented 2 weeks ago

Maybe I'm reading this wrong and the time zones aren't aligned, but it looks like it trying to schedule updates in the past which than execute right away

2024-05-03 23:37:20.850 DEBUG (MainThread) [homeassistant.components.statistics.sensor] sensor.home_electricity_power_standard_deviation_over_an_hour: scheduling update at 2024-05-03 23:36:44.809911+00:00

bdraco commented 2 weeks ago

It looks like its scheduling an update to make sure old states are purged but it doesn't actually purge the state so the next update gets scheduled in the past because its expected it had purged the state, and than it tries to do it again and again

bdraco commented 2 weeks ago

I've not had any luck replicating the problem.

Can you post your config for statistics?

bdraco commented 2 weeks ago

https://github.com/home-assistant/core/pull/116743 should eliminate the chance of a race unless there is a bug in the time calculation.

Can be tested with

cd /config
curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743
to4ko commented 2 weeks ago

116743 should eliminate the chance of a race unless there is a bug in the time calculation.

Can be tested with

cd /config
curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743

applied the fix above...its kinda better, but still way not even close to pre-2024.5.x versions

fix applied this morning (red line) image

this is how i was before May - massive cpu heating started this Wednesday May 1. image

dannytsang commented 2 weeks ago

I've not had any luck replicating the problem.

Can you post your config for statistics?

Here are all mine. I use packages so they are not all in a single file (if that makes any difference):

sensor:
  - platform: statistics
    name: "Home Electricity Power Average Over An Hour"
    unique_id: 97f8c377-acd2-4d92-a2f2-6bbf5ca0b110
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: mean
    max_age:
      hours: 1
  - platform: statistics
    name: "Home Electricity Power Average Over A Day"
    unique_id: 74b261eb-6feb-473d-87b4-8eca961be99d
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: mean
    max_age:
      hours: 24
  - platform: statistics
    name: "Home Electricity Power Average Over A Week"
    unique_id: 47f78be2-bea5-499c-926b-50b48f92f202
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: mean
    max_age:
      days: 7
  - platform: statistics
    name: "Home Electricity Power Average Over A Month"
    unique_id: c0b7e702-5468-4cb1-8221-8067cf44c3e1
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: mean
    max_age:
      days: 30
  - platform: statistics
    name: "Home Electricity Power Median Over An Hour"
    unique_id: b9256d7e-504c-43e2-ac7d-f77ed5b69aaa
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: median
    max_age:
      hours: 1
  - platform: statistics
    name: "Home Electricity Power Median Over A Day"
    unique_id: 82ff2fd7-2694-4171-b0ec-3cafc52205cf
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: median
    max_age:
      hours: 24
  - platform: statistics
    name: "Home Electricity Power Median Over A Week"
    unique_id: fe221135-0a10-414b-8954-137ea83aa73f
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: median
    max_age:
      days: 7
  - platform: statistics
    name: "Home Electricity Power Median Over A Month"
    unique_id: 5891e92f-7bbc-4c51-941e-aa005fba93f9
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: median
    max_age:
      days: 30
  - platform: statistics
    name: "Home Electricity Power Standard Deviation Over An Hour"
    unique_id: 3b115d2e-f765-4e47-b53d-03ff61ca6ada
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: standard_deviation
    max_age:
      hours: 1
  - platform: statistics
    name: "Home Electricity Power Standard Deviation Over A Day"
    unique_id: 3a0d219c-2dc3-4ef5-8964-23e739768499
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: standard_deviation
    max_age:
      hours: 24
  - platform: statistics
    name: "Home Electricity Power Standard Deviation Over A Week"
    unique_id: c5a650f5-7d46-4918-a484-485831c7a922
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: standard_deviation
    max_age:
      days: 7
  - platform: statistics
    name: "Home Electricity Power Standard Deviation Over A Month"
    unique_id: 15d61ca0-bf3f-4959-a526-bb4c5b9f851b
    entity_id: sensor.growatt_sph_load_power
    state_characteristic: standard_deviation
    max_age:
      days: 30
  # Boiler Flow
  - platform: statistics
    name: "Flow Temperature Delta Last 24 Hours"
    unique_id: 96bc7959-a867-4d5b-b733-188bcc25c92c
    entity_id: sensor.boiler_delta_temperature
    state_characteristic: mean
    max_age:
      hours: 24
  - platform: statistics
    name: "Flow Temperature Highest Delta"
    unique_id: aaac8076-2bfa-4f8d-825f-2603051f7eb9
    entity_id: sensor.boiler_delta_temperature
    state_characteristic: value_max
    sampling_size: 43200
    max_age:
      days: 30
  - platform: statistics
    name: "Flow Temperature Lowest Delta"
    unique_id: e4febfc7-2d8b-443d-8aae-b9d5d86e6449
    entity_id: sensor.boiler_delta_temperature
    state_characteristic: value_min
    sampling_size: 43200
    max_age:
      days: 30
  - platform: statistics
    name: "Flow Temperature Delta Difference"
    unique_id: 99a1f9d4-b8bb-456d-840a-e3ad0c997215
    entity_id: sensor.boiler_delta_temperature
    state_characteristic: distance_absolute
    sampling_size: 43200
    max_age:
      days: 30
to4ko commented 2 weeks ago

@bdraco looks like your fix is working

image

more detailed last 48hrs - cpu temp got normalized around 60degC and CPU load normalized (little build up and drop at 12^36 to be invetigated deeper) image

bdraco commented 2 weeks ago

There's still a lot to be optimized in this integration, but to be honest, that's very hard to do without access to a system that's experiencing the issue that I could do live debugging on and rapidly iterate as otherwise I'm optimizing for theoretical use cases instead of practical ones

ayavilevich commented 1 week ago

@to4ko @dannytsang @myromeo Are you using the default sqlite database for recorder or do you have something custom like MariaDB or mysql?

myromeo commented 1 week ago

@ayavilevich i use the default SQLite.

dannytsang commented 1 week ago

@ayavilevich MySQL

ayavilevich commented 1 week ago

I've not had any luck replicating the problem.

Can you post your config for statistics?

Attaching my "platform: statistics" entries. Is "platform: history_stats" relevant as well? What about automatic statistics that are generated for the metrics? At "http://server/developer-tools/statistics" Please note that it takes it about 12 hours to ramp up to 100% CPU utilization. As if there is a "CPU leak". So it may be difficult to see this right away.

sensor:
  - platform: statistics
    name: "Pumps Current Mean 10m"
    entity_id: sensor.pumps_current
    state_characteristic: mean
    max_age:
      minutes: 10
  - platform: statistics
    name: "Pumps Current Mean 1d"
    entity_id: sensor.pumps_current
    state_characteristic: mean
    max_age:
      hours: 24
  - platform: statistics
    name: "Clouds mean 8h"
    entity_id: sensor.openweathermap_cloud_coverage
    state_characteristic: mean
    max_age:
      hours: 8
  - platform: statistics
    name: "Wind mean"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: mean
    max_age:
      minutes: 5
  - platform: statistics
    name: "Wind mean 15m"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: mean
    max_age:
      minutes: 15
  - platform: statistics
    name: "Wind mean 10m"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: mean
    max_age:
      minutes: 10
  - platform: statistics
    name: "Wind mean 30m"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: mean
    max_age:
      minutes: 30
  - platform: statistics
    name: "Wind mean 30s"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: mean
    max_age:
      seconds: 30
  - platform: statistics
    name: "Wind median"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: median
    max_age:
      minutes: 5
  - platform: statistics
    name: "Wind min"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: value_min
    max_age:
      minutes: 5
  - platform: statistics
    name: "Wind max"
    entity_id: sensor.tasmota_wind_analog_range1
    state_characteristic: value_max
    max_age:
      minutes: 5
  - platform: statistics
    name: "Clouds mean 8h"
    entity_id: sensor.openweathermap_cloud_coverage
    state_characteristic: mean
    max_age:
      hours: 8
myromeo commented 1 week ago

Here are my statistics sensors too, fwiw.

`# Calculate 90 day average CGM

Electricity demand stats - 24 hour period

House Temperature (Living Room) stats - 24 hour period

to4ko commented 1 week ago

@to4ko @dannytsang @myromeo Are you using the default sqlite database for recorder or do you have something custom like MariaDB or mysql?

default SQLite

dannytsang commented 1 week ago

116743 should eliminate the chance of a race unless there is a bug in the time calculation.

Can be tested with

cd /config
curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743

This has definitely improved my setup. Thanks @bdraco image

churchofnoise commented 1 week ago

https://github.com/home-assistant/core/pull/116743 should eliminate the chance of a race unless there is a bug in the time calculation.

Can be tested with

cd /config
curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743

Seems to work here too, thank you!

myromeo commented 1 week ago

Also looking good here, nice work.

bdraco commented 1 week ago

Be sure to delete the custom version before updating to 2024.5.2

to4ko commented 1 week ago

Be sure to delete the custom version before updating to 2024.5.2

Should statistics folder be deleted totally or just statistics.py? Please advise for dumb-and-dumber:)

dannytsang commented 1 week ago

Folder.