home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.6k stars 30.76k forks source link

Statistics takes longer than 1.111 seconds at startup #110433

Open Mariusthvdb opened 9 months ago

Mariusthvdb commented 9 months ago

The problem

running debugpy reveals several remarkable timings during setup. writing this up per request.

What version of Home Assistant Core has the issue?

2024.2.x

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

statistics

Link to integration documentation on our website

https://www.home-assistant.io/integrations/statistics/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

2024-02-12 18:06:06.833 WARNING (MainThread) [asyncio] Executing <Task finished name='Task-13034' coro=<StatisticsSensor._initialize_from_database() done, defined at /usr/src/homeassistant/homeassistant/components/statistics/sensor.py:534> result=None created at /usr/src/homeassistant/homeassistant/core.py:633> took 1.111 seconds

Additional information

No response

home-assistant[bot] commented 9 months ago

Hey there @thomdietrich, mind taking a look at this issue as it has been labeled with an integration (statistics) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `statistics` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign statistics` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


statistics documentation statistics source (message by IssueLinks)

bdraco commented 9 months ago

The problem is almost certainly in https://github.com/home-assistant/core/blob/92842c28d212479d9a977b989cbe8b5ca6aee50e/homeassistant/components/statistics/sensor.py#L548

Its calling _add_state_to_queue which for a power sensor that updates every 5s in a 24h period will be 17280 calls.

It likely needs an optimized version for restoring data

bdraco commented 9 months ago

The whole integration is doing a lot of datetime math could likely be all floats instead since we don't need local time

Mariusthvdb commented 9 months ago

forgot to post the yaml configs, sorry, do you need these too, or is it a generic issue

btw, the power sensor updates way more frequently than every 5 seconds... it is per state change of the smart meter, and that can be more than once per second so it seems. guess you can safely multiply by 5 there

so, this is using a template sensor (UI) based on 2 integrations (reading smart meter (mqtt) and solar(mqtt)),

        state: >
         {{expand('sensor.zp_actuele_opbrengst','sensor.netto_verbruik')
           |selectattr('state','is_number')
           |map(attribute='state')
           |map('float')|sum}}

then fed into a filter:

  - unique_id: bruto_stroom_verbruik_filter
    platform: filter
    entity_id: sensor.bruto_stroom_verbruik
    filters:
      - filter: outlier
        window_size: 5
        radius: 1.5
      - filter: time_simple_moving_average
        window_size: '00:00:30'
        precision: 3
      - filter: range
        lower_bound: 0

and finally the statistics:

  - unique_id: standby_power_laatste_24_uur
    platform: statistics
    entity_id: sensor.bruto_stroom_verbruik_filter
    state_characteristic: percentile
    percentile: 30
    max_age:
      hours: 24
    precision: 3
Mariusthvdb commented 9 months ago

also, maybe consider, add a polling/update frequency option. seems unnecessary to update this frequently, once a minute should suffice as absolute minimum, maybe even lower.

think trigger template, or scan_interval:

Aware this has been discussed before, maybe revisit that?

bdraco commented 8 months ago

This one is still a problem even after https://github.com/home-assistant/core/pull/112127 in 2024.4

ThomDietrich commented 8 months ago

Hey guys, this is a difficult issue. Generally the statistics integration is NOT optimized for huge sampling settings. I did always try to highlight that to users, however I can see that the warning did not make it into docs. Of course we can agree that this is not a valid justification or solution.

Recently I didn't have the time to add further improvements but I had planned to implement optimizations per characteristic (e.g. for the max value we need to store only one value, the largest one so far). For the percentile an optimization is not trivial.

( If I may go a bit on a tangent. Some time ago I did discuss the nature of the statistics component with one of the core developers. The component provides timeseries data processing capabilities the main HA database does not provide. Because of that the statistics component needs to maintain its own data storage and implements many typical arithmetic functions. It is my humble opinion that the HA database (a traditional relational database) should be replaced by one of the amazing open source timeseries databases out there. This would solve many core issues (and feature requests) and performance challenges, and the statistics component would be nearly unnecessary. My recommendation was InfluxDB but there might be an even better fit out there. Working on the statistics component always felt like building a house on muddy soil. )

Anyhow. @bdraco back when you worked on https://github.com/home-assistant/core/pull/110438 did you compare performance/memory of an extreme use case?

bdraco commented 8 months ago

Anyhow. @bdraco back when you worked on #110438 did you compare performance/memory of an extreme use case?

I was only looking at improving the performance of the 1 day of date with a sample rate of 5s ~17280 samples