Closed myromeo closed 1 week ago
Please post a callgrind file from the profiler
https://my.home-assistant.io/redirect/config_flow_start/?domain=profiler https://my.home-assistant.io/redirect/developer_call_service/?service=profiler.start
I observe the same symptoms, callgrind is here:
callgrind.out.1714720380542446.txt
If I read it correctly, it seems to be caused by statistics platform sensors in my case.
Hey there @thomdietrich, mind taking a look at this issue as it has been labeled with an integration (statistics
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
statistics documentation statistics source (message by IssueLinks)
If I read it correctly, it seems to be caused by statistics platform sensors in my case.
Spot on
72 million calls to total_seconds
https://github.com/home-assistant/core/pull/110438 will probably fix it, but I never did get time to get back to iterate on that PR
Any clue why this appeared with the 2024.5? The statistics code doesn't seem to be changed recently.
Its been a problem for a while, and seems to come up every time the timings change a bit, see https://github.com/home-assistant/core/issues/110433
I'm not exactly sure what triggers the issue as I've never been able to fully replicate it to the degree you have in your profile.
_initialize_from_database
could be refactored to only look at the last state to set the attributes instead of doing the calculations inside of _add_state_to_queue
for every previous state and throwing it away.
That would probably help as well
In my case, it seems to be caused by 30 minute average_linear
statistics for a frequently updated sensor. I should probably look for a better way to achieve the same.
PS: After changing a bit calculation formula for the originating sensor, I'm down to 12M calls of total_seconds
and my system no longer runs fans on full power. Still, this is the most CPU intensive operation my Home Assistant does.
I am having an issue with statistics since upgrade to 2024.5.0 too.
Here is the CPU graph:
It starts normal after a restart but then after a few hours gradually becomes CPU over loaded.
I ran the profiler. High CPU places are:
statistics.py:154(_sum) ~:0(<method 'as_integer_ratio' of 'float' objects>) statistics.py:287(_exact_ratio)
I use statistics a lot but didn't have an issue before. Going to rollback now.
I think the performance problems could be solved, by breaking out the code into a new object, and then breaking that out into a package, it could also be Cythonized to do all the calculations in native code.
Another alternative would be to rewrite the expensive operations in numpy
callgrind.out.1714751198628950.txt
As requested callgrind file. How do I make sense of this file as I have no idea what I am looking at!
callgrind.out.1714751198628950.txt
As requested callgrind file. How do I make sense of this file as I have no idea what I am looking at!
See https://www.home-assistant.io/integrations/profiler
For example, you can use SnakeViz to render it and see what functions consume most of the CPU.
Correction, SnakeViz is for the .cprof. There are other software for the callgrind.
Hey guys, hey @bdraco, this sounds horrible. Any idea what caused that issue? My first guess would be that this might be related to the introduction of state_reported events, however the statistics component doesn't use that one yet. I'm thinking there might have been another change that lead to this.
@ayavilevich
callgrind.out.1714751198628950.txt
As requested callgrind file. How do I make sense of this file as I have no idea what I am looking at!
Looks like this case of many sqlite queries.
Can you turn on info (maybe even debug) logging for sqlalchemy
as well?
how do I do that?
service: logger.set_level
data:
sqlalchemy: info
Hey guys, hey @bdraco, this sounds horrible. Any idea what caused that issue? My first guess would be that this might be related to the introduction of state_reported events, however the statistics component doesn't use that one yet. I'm thinking there might have been another change that lead to this.
I'm not really sure whats going on at this point as it seems all the profiles have a lot of database access going on, but AFAICT _initialize_from_database
is only ever called once per entity.
The only thing I can think is the database access is unrelated and because of the busy system otherwise as _initialize_from_database
doesn't appear in the profile
Maybe there is some type of race in _next_to_purge_timestamp
and its scheduling the next update too soon so its updating over and over again
service: logger.set_level
data:
sqlalchemy: info
homeassistant.components.statistics: debug
It would be good to get debug logs for statistics
as well so we can tell if its the scheduling update
firing too often
@bdraco I've set that, what next?
I am not sure if that is relevant, but since you mentioned database activity, I would like to add that I am using MariaDB as the database for the recorder. Is that common to others who are having the issue?
@bdraco I've set that, what next?
Please post the log (remove anything sensitive)
Maybe I'm reading this wrong and the time zones aren't aligned, but it looks like it trying to schedule updates in the past which than execute right away
2024-05-03 23:37:20.850 DEBUG (MainThread) [homeassistant.components.statistics.sensor] sensor.home_electricity_power_standard_deviation_over_an_hour: scheduling update at 2024-05-03 23:36:44.809911+00:00
It looks like its scheduling an update to make sure old states are purged but it doesn't actually purge the state so the next update gets scheduled in the past because its expected it had purged the state, and than it tries to do it again and again
I've not had any luck replicating the problem.
Can you post your config for statistics
?
https://github.com/home-assistant/core/pull/116743 should eliminate the chance of a race unless there is a bug in the time calculation.
Can be tested with
cd /config
curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743
116743 should eliminate the chance of a race unless there is a bug in the time calculation.
Can be tested with
cd /config curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743
applied the fix above...its kinda better, but still way not even close to pre-2024.5.x versions
fix applied this morning (red line)
this is how i was before May - massive cpu heating started this Wednesday May 1.
I've not had any luck replicating the problem.
Can you post your config for
statistics
?
Here are all mine. I use packages so they are not all in a single file (if that makes any difference):
sensor:
- platform: statistics
name: "Home Electricity Power Average Over An Hour"
unique_id: 97f8c377-acd2-4d92-a2f2-6bbf5ca0b110
entity_id: sensor.growatt_sph_load_power
state_characteristic: mean
max_age:
hours: 1
- platform: statistics
name: "Home Electricity Power Average Over A Day"
unique_id: 74b261eb-6feb-473d-87b4-8eca961be99d
entity_id: sensor.growatt_sph_load_power
state_characteristic: mean
max_age:
hours: 24
- platform: statistics
name: "Home Electricity Power Average Over A Week"
unique_id: 47f78be2-bea5-499c-926b-50b48f92f202
entity_id: sensor.growatt_sph_load_power
state_characteristic: mean
max_age:
days: 7
- platform: statistics
name: "Home Electricity Power Average Over A Month"
unique_id: c0b7e702-5468-4cb1-8221-8067cf44c3e1
entity_id: sensor.growatt_sph_load_power
state_characteristic: mean
max_age:
days: 30
- platform: statistics
name: "Home Electricity Power Median Over An Hour"
unique_id: b9256d7e-504c-43e2-ac7d-f77ed5b69aaa
entity_id: sensor.growatt_sph_load_power
state_characteristic: median
max_age:
hours: 1
- platform: statistics
name: "Home Electricity Power Median Over A Day"
unique_id: 82ff2fd7-2694-4171-b0ec-3cafc52205cf
entity_id: sensor.growatt_sph_load_power
state_characteristic: median
max_age:
hours: 24
- platform: statistics
name: "Home Electricity Power Median Over A Week"
unique_id: fe221135-0a10-414b-8954-137ea83aa73f
entity_id: sensor.growatt_sph_load_power
state_characteristic: median
max_age:
days: 7
- platform: statistics
name: "Home Electricity Power Median Over A Month"
unique_id: 5891e92f-7bbc-4c51-941e-aa005fba93f9
entity_id: sensor.growatt_sph_load_power
state_characteristic: median
max_age:
days: 30
- platform: statistics
name: "Home Electricity Power Standard Deviation Over An Hour"
unique_id: 3b115d2e-f765-4e47-b53d-03ff61ca6ada
entity_id: sensor.growatt_sph_load_power
state_characteristic: standard_deviation
max_age:
hours: 1
- platform: statistics
name: "Home Electricity Power Standard Deviation Over A Day"
unique_id: 3a0d219c-2dc3-4ef5-8964-23e739768499
entity_id: sensor.growatt_sph_load_power
state_characteristic: standard_deviation
max_age:
hours: 24
- platform: statistics
name: "Home Electricity Power Standard Deviation Over A Week"
unique_id: c5a650f5-7d46-4918-a484-485831c7a922
entity_id: sensor.growatt_sph_load_power
state_characteristic: standard_deviation
max_age:
days: 7
- platform: statistics
name: "Home Electricity Power Standard Deviation Over A Month"
unique_id: 15d61ca0-bf3f-4959-a526-bb4c5b9f851b
entity_id: sensor.growatt_sph_load_power
state_characteristic: standard_deviation
max_age:
days: 30
# Boiler Flow
- platform: statistics
name: "Flow Temperature Delta Last 24 Hours"
unique_id: 96bc7959-a867-4d5b-b733-188bcc25c92c
entity_id: sensor.boiler_delta_temperature
state_characteristic: mean
max_age:
hours: 24
- platform: statistics
name: "Flow Temperature Highest Delta"
unique_id: aaac8076-2bfa-4f8d-825f-2603051f7eb9
entity_id: sensor.boiler_delta_temperature
state_characteristic: value_max
sampling_size: 43200
max_age:
days: 30
- platform: statistics
name: "Flow Temperature Lowest Delta"
unique_id: e4febfc7-2d8b-443d-8aae-b9d5d86e6449
entity_id: sensor.boiler_delta_temperature
state_characteristic: value_min
sampling_size: 43200
max_age:
days: 30
- platform: statistics
name: "Flow Temperature Delta Difference"
unique_id: 99a1f9d4-b8bb-456d-840a-e3ad0c997215
entity_id: sensor.boiler_delta_temperature
state_characteristic: distance_absolute
sampling_size: 43200
max_age:
days: 30
@bdraco looks like your fix is working
more detailed last 48hrs - cpu temp got normalized around 60degC and CPU load normalized (little build up and drop at 12^36 to be invetigated deeper)
There's still a lot to be optimized in this integration, but to be honest, that's very hard to do without access to a system that's experiencing the issue that I could do live debugging on and rapidly iterate as otherwise I'm optimizing for theoretical use cases instead of practical ones
@to4ko @dannytsang @myromeo Are you using the default sqlite database for recorder or do you have something custom like MariaDB or mysql?
@ayavilevich i use the default SQLite.
@ayavilevich MySQL
I've not had any luck replicating the problem.
Can you post your config for
statistics
?
Attaching my "platform: statistics" entries. Is "platform: history_stats" relevant as well? What about automatic statistics that are generated for the metrics? At "http://server/developer-tools/statistics" Please note that it takes it about 12 hours to ramp up to 100% CPU utilization. As if there is a "CPU leak". So it may be difficult to see this right away.
sensor:
- platform: statistics
name: "Pumps Current Mean 10m"
entity_id: sensor.pumps_current
state_characteristic: mean
max_age:
minutes: 10
- platform: statistics
name: "Pumps Current Mean 1d"
entity_id: sensor.pumps_current
state_characteristic: mean
max_age:
hours: 24
- platform: statistics
name: "Clouds mean 8h"
entity_id: sensor.openweathermap_cloud_coverage
state_characteristic: mean
max_age:
hours: 8
- platform: statistics
name: "Wind mean"
entity_id: sensor.tasmota_wind_analog_range1
state_characteristic: mean
max_age:
minutes: 5
- platform: statistics
name: "Wind mean 15m"
entity_id: sensor.tasmota_wind_analog_range1
state_characteristic: mean
max_age:
minutes: 15
- platform: statistics
name: "Wind mean 10m"
entity_id: sensor.tasmota_wind_analog_range1
state_characteristic: mean
max_age:
minutes: 10
- platform: statistics
name: "Wind mean 30m"
entity_id: sensor.tasmota_wind_analog_range1
state_characteristic: mean
max_age:
minutes: 30
- platform: statistics
name: "Wind mean 30s"
entity_id: sensor.tasmota_wind_analog_range1
state_characteristic: mean
max_age:
seconds: 30
- platform: statistics
name: "Wind median"
entity_id: sensor.tasmota_wind_analog_range1
state_characteristic: median
max_age:
minutes: 5
- platform: statistics
name: "Wind min"
entity_id: sensor.tasmota_wind_analog_range1
state_characteristic: value_min
max_age:
minutes: 5
- platform: statistics
name: "Wind max"
entity_id: sensor.tasmota_wind_analog_range1
state_characteristic: value_max
max_age:
minutes: 5
- platform: statistics
name: "Clouds mean 8h"
entity_id: sensor.openweathermap_cloud_coverage
state_characteristic: mean
max_age:
hours: 8
Here are my statistics sensors too, fwiw.
`# Calculate 90 day average CGM
@to4ko @dannytsang @myromeo Are you using the default sqlite database for recorder or do you have something custom like MariaDB or mysql?
default SQLite
116743 should eliminate the chance of a race unless there is a bug in the time calculation.
Can be tested with
cd /config curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743
This has definitely improved my setup. Thanks @bdraco
https://github.com/home-assistant/core/pull/116743 should eliminate the chance of a race unless there is a bug in the time calculation.
Can be tested with
cd /config curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d statistics -p 116743
Seems to work here too, thank you!
Also looking good here, nice work.
Be sure to delete the custom version before updating to 2024.5.2
Be sure to delete the custom version before updating to 2024.5.2
Should statistics folder be deleted totally or just statistics.py? Please advise for dumb-and-dumber:)
Folder.
The problem
Updated to 2024.5.0 and noted an increase in CPU usage (around 5% higher) resulting in CPU temps increasing by 30%+. Rolled back to 2024.4.4 and usage and temps immediately dropped.
What version of Home Assistant Core has the issue?
core-2024.5.0
What was the last working version of Home Assistant Core?
core-2024.4.4
What type of installation are you running?
Home Assistant OS
Integration causing the issue
No response
Link to integration documentation on our website
No response
Diagnostics information
No response
Example YAML snippet
No response
Anything in the logs that might be useful for us?
Additional information
X64, intel i5-4590T on Lenovo m93p tiny.