home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
69.94k stars 29.03k forks source link

Super high CPU usage with 2024.4.1 #115072

Closed ic-dev21 closed 2 months ago

ic-dev21 commented 3 months ago

The problem

Updating my docker container to 2024.4.1 causes CPU to climb to 100% over the span of a few hours. It never goes down. The only fix is to revert back to 2024.3.3.

Something is wrong with this release.

What version of Home Assistant Core has the issue?

2024.4.1

What was the last working version of Home Assistant Core?

2024.3.3

What type of installation are you running?

Home Assistant Container

Integration causing the issue

recorder

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Logger: homeassistant.components.recorder.core
Source: components/recorder/core.py:384
integration: Recorder (documentation, issues)
First occurred: 10:08:17 PM (1 occurrences)
Last logged: 10:08:17 PM

The recorder backlog queue reached the maximum size of 563657 events; usually, the system is CPU bound, I/O bound, or the database is corrupt due to a disk problem; The recorder will stop recording events to avoid running out of memory

Additional information

No response

bdraco commented 3 months ago

Install the profiler integration and use the profiler.start service to generate a callgrind

https://www.home-assistant.io/integrations/profiler/#service-profilerstart

Post the callgrind here and we can take a look

bdraco commented 3 months ago
# Example configuration.yaml entry
logger:
  default: info
  logs:
    homeassistant.core: debug

You might want to enable debug logging for core as well as it will show you where the massive amount of events you see in your log entry are coming from.

ic-dev21 commented 3 months ago

Install the profiler integration and use the profiler.start service to generate a callgrind

https://www.home-assistant.io/integrations/profiler/#service-profilerstart

Post the callgrind here and we can take a look

I will have to try this later today as I'm not home to test. I did notice when updating to 2024.4.1 that a mushroom card (HACS) would constantly timeout or reload. I removed this integration and went back to the 2024.4.1 image. If I start racking up CPU cycles again I'll let you know tonight.

ic-dev21 commented 3 months ago
# Example configuration.yaml entry
logger:
  default: info
  logs:
    homeassistant.core: debug

You might want to enable debug logging for core as well as it will show you where the massive amount of events you see in your log entry are coming from.

My install locked up again at 100% CPU after an hour.

This release is broken. Reverting fixes it.

ic-dev21 commented 3 months ago

Adding debug logging for core and profiler.

I run three instances (2 dev for a custom component and 1 prod). Only my prod experiences this. My main dev is on 2024.4.1 with a fully vanilla instance except for the custom component I maintain and doesn't experience this.

My prod has my whole house with 100% native HA components + my custom component.

@bdraco when do I need to run profiler? When CPU is high or anytime?

bdraco commented 3 months ago

Generally you want to do it with both cases so you have a baseline to compare against

ic-dev21 commented 3 months ago

Generally you want to do it with both cases so you have a baseline to compare against

Profiler is running now. Baseline below Archive.zip

ic-dev21 commented 3 months ago

Baseline log home-assistant.log

bdraco commented 3 months ago

Both of those look OK which is a good baseline to compare with when the problem does happen

ic-dev21 commented 3 months ago

Both of those look OK which is a good baseline to compare with when the problem does happen

As luck would have it it's running super stable at the moment. I've tried upgrading multiple times over the last few days with the same result every time. Full system lockup within an hour or two with python3 pegging 1 CPU at 100%. Will keep the logs going for a while still.

ic-dev21 commented 3 months ago

In between this test and the last one, I did update zigbee2mqtt. Perhaps this may help with troublehsooting.

ic-dev21 commented 3 months ago

Woke up this morning to a completely locked up and unresponsive system. Had to hard reset the host as it was unresponsive. No profiler but here's the huge log. home-assistant.log.zip

Reverting back again. This release is definitely broken for me. Woke up to a cold house as no automation for the heat worked.

ic-dev21 commented 3 months ago

Woke up this morning to a completely locked up and unresponsive system. Had to hard reset it. No profiler but here's the huge log. home-assistant.log.zip

Reverting back again. This release is definitely broken for me. Woke up to a cold house as no automation for the heat worked.

@bdraco have you guys changed how automations speak to the event bus? Looking at this it seems one of my automations that has worked fine for years is going haywire and triggering every millisecond.

Whatever it is you guys did needs reversing. This is a native automation, using a native HA component.

That automation is due to your native gree component losing connectivity to the AC so I wait for it to become available again and resend the command it missed.

Automation is below:

alias: Raise temperature error check
description: ""
trigger:
  - platform: state
    entity_id:
      - automation.raise_temperature
    attribute: last_triggered
    id: Automation Start
    for:
      hours: 0
      minutes: 0
      seconds: 0
  - platform: state
    entity_id:
      - automation.vero_6_8
    attribute: last_triggered
    id: 6-8
  - platform: state
    entity_id:
      - climate.climatisation
    from: unavailable
    to: heat
    id: Unavailable
  - platform: state
    entity_id:
      - automation.raise_temperature_end
    attribute: last_triggered
    id: Automation End
    for:
      hours: 0
      minutes: 0
      seconds: 0
condition: []
action:
  - choose:
      - conditions:
          - condition: trigger
            id:
              - Automation Start
        sequence:
          - if:
              - condition: or
                conditions:
                  - condition: state
                    entity_id: climate.climatisation
                    state: unavailable
                  - condition: numeric_state
                    entity_id: climate.climatisation
                    below: 23
                    attribute: temperature
                  - condition: state
                    entity_id: input_boolean.raise_temperature_on
                    state: "off"
            then:
              - wait_for_trigger: []
                timeout:
                  hours: 0
                  minutes: 5
                  seconds: 0
                  milliseconds: 0
                continue_on_timeout: false
              - service: automation.trigger
                metadata: {}
                data:
                  skip_condition: false
                target:
                  entity_id: automation.raise_temperature
        alias: Raise Temperature Start
      - conditions:
          - condition: trigger
            id:
              - 6-8
        sequence: []
      - conditions:
          - condition: trigger
            id:
              - Automation End
        sequence:
          - if:
              - condition: or
                conditions:
                  - condition: state
                    entity_id: climate.climatisation
                    state: unavailable
                  - condition: numeric_state
                    entity_id: climate.climatisation
                    attribute: temperature
                    above: 22
                  - condition: state
                    entity_id: input_boolean.raise_temperature_on
                    state: "on"
            then:
              - wait_for_trigger: []
                timeout:
                  hours: 0
                  minutes: 5
                  seconds: 0
                  milliseconds: 0
                continue_on_timeout: false
              - service: automation.trigger
                metadata: {}
                data:
                  skip_condition: true
                target:
                  entity_id: automation.raise_temperature_end
        alias: Raise Temperature End
mode: single

Nothing fancy.

ic-dev21 commented 3 months ago

@bdraco Would your timer fix do anything for me? I can give it a try in 2024.4.2 I'll be home for a while and can maybe run a profiler if the host goes nuts again.

This one : https://github.com/home-assistant/core/pull/115110

bdraco commented 3 months ago

I have yet to fully step through it, though. My day job is hectic right now, so I only have a little time for HA. I'll take another look before the weekend if I get free cycles. From a cursory review, I don't think so since your automation doesn't look like you have any until or while conditions.

However, your automation is calling several other automations, so please post those as well.

ic-dev21 commented 3 months ago

I have yet to fully step through it, though. My day job is hectic right now, so I only have a little time for HA. I'll take another look before the weekend if I get free cycles. From a cursory review, I don't think so since your automation doesn't look like you have any until or while conditions.

However, your automation is calling several other automations, so please post those as well.

Here you go. First one:

alias: Raise Temperature
description: ""
trigger:
  - platform: numeric_state
    entity_id: climate.thermostat_cuisine
    for:
      hours: 0
      minutes: 15
      seconds: 0
    below: 20.6
    attribute: current_temperature
    id: LowTemp Thermostat
  - platform: state
    entity_id:
      - binary_sensor.puissance_thermostat_en_haut
    to: "on"
    id: Puissance bool
  - platform: numeric_state
    entity_id: sensor.thermostat_cuisine_power
    above: 1500
    id: Puissance
  - platform: numeric_state
    entity_id:
      - sensor.sensorsaloncuisine_temperature
    for:
      hours: 0
      minutes: 15
      seconds: 0
    below: 21.8
    id: Sensor
  - platform: state
    entity_id:
      - sensor.defi_hilo
    id: Défi
    to: pre_heat
  - platform: state
    entity_id:
      - input_boolean.heat_boost_start
    to: "on"
    id: Manuel
condition: []
action:
  - choose:
      - conditions:
          - condition: trigger
            id:
              - Manuel
              - Sensor
          - condition: or
            conditions:
              - condition: time
                before: "22:00:00"
                after: "05:30:00"
              - condition: state
                entity_id: sensor.defi_hilo
                state: pre_heat
          - condition: not
            conditions:
              - condition: state
                entity_id: sensor.season
                state: summer
              - condition: state
                entity_id: input_boolean.raise_temperature_on
                state: "on"
              - condition: state
                entity_id: climate.climatisation
                state: cool
          - condition: or
            conditions:
              - condition: state
                entity_id: person.vero
                state: home
              - condition: state
                entity_id: person.ian_couture
                state: home
              - condition: state
                entity_id: input_boolean.prechauffe_pm_debutee
                state: "on"
              - condition: state
                entity_id: input_boolean.gardienne_chouchoune
                state: "on"
              - condition: state
                entity_id: input_boolean.prechauffe_pm_debutee_permissive
                state: "on"
          - condition: or
            conditions:
              - condition: state
                entity_id: input_boolean.chauffage_chambres
                state: "on"
              - condition: state
                entity_id: input_boolean.chauffage
                state: "on"
          - condition: numeric_state
            entity_id: climate.thermostat_cuisine
            attribute: temperature
            above: 19
        sequence:
          - service: climate.set_temperature
            data:
              temperature: >-
                {{max(state_attr('climate.climatisation',
                'current_temperature'), 22) | int + 2}}
            target:
              device_id: 64bfa1752510b1baf9d7130e80a621ed
          - service: climate.set_preset_mode
            data:
              preset_mode: boost
            target:
              device_id: 64bfa1752510b1baf9d7130e80a621ed
          - service: input_boolean.turn_on
            data: {}
            target:
              entity_id: input_boolean.raise_temperature_on
          - delay:
              hours: 0
              minutes: 5
              seconds: 0
              milliseconds: 0
          - service: input_boolean.turn_off
            data: {}
            target:
              entity_id: input_boolean.heat_boost_start
          - service: climate.set_preset_mode
            data:
              preset_mode: none
            target:
              device_id: 64bfa1752510b1baf9d7130e80a621ed
      - conditions:
          - condition: trigger
            id:
              - LowTemp Thermostat
              - Puissance
              - Défi
              - Puissance bool
          - condition: or
            conditions:
              - condition: time
                before: "22:00:00"
                after: "05:30:00"
              - condition: state
                entity_id: sensor.defi_hilo
                state: pre_heat
          - condition: not
            conditions:
              - condition: state
                entity_id: sensor.season
                state: summer
              - condition: state
                entity_id: input_boolean.raise_temperature_on
                state: "on"
              - condition: state
                entity_id: climate.climatisation
                state: cool
          - condition: or
            conditions:
              - condition: state
                entity_id: person.vero
                state: home
              - condition: state
                entity_id: person.ian_couture
                state: home
              - condition: state
                entity_id: input_boolean.prechauffe_pm_debutee
                state: "on"
              - condition: state
                entity_id: input_boolean.gardienne_chouchoune
                state: "on"
              - condition: state
                entity_id: input_boolean.prechauffe_pm_debutee_permissive
                state: "on"
          - condition: or
            conditions:
              - condition: state
                entity_id: input_boolean.chauffage_chambres
                state: "on"
              - condition: state
                entity_id: input_boolean.chauffage
                state: "on"
          - condition: numeric_state
            entity_id: climate.thermostat_cuisine
            attribute: temperature
            above: 19
        sequence:
          - service: climate.set_temperature
            data:
              temperature: >-
                {{max(state_attr('climate.climatisation',
                'current_temperature'), 22) | int + 2}}
            target:
              device_id: 64bfa1752510b1baf9d7130e80a621ed
          - service: climate.set_preset_mode
            data:
              preset_mode: boost
            target:
              device_id: 64bfa1752510b1baf9d7130e80a621ed
          - service: input_boolean.turn_on
            data: {}
            target:
              entity_id: input_boolean.raise_temperature_on
          - delay:
              hours: 0
              minutes: 5
              seconds: 0
              milliseconds: 0
          - service: input_boolean.turn_off
            data: {}
            target:
              entity_id: input_boolean.heat_boost_start
          - service: climate.set_preset_mode
            data:
              preset_mode: none
            target:
              device_id: 64bfa1752510b1baf9d7130e80a621ed
mode: single

Second one:

alias: Raise temperature end
description: Retour à la normale
trigger:
  - platform: state
    entity_id:
      - binary_sensor.temperature_atteinte
    to: "on"
    from: "off"
    for:
      hours: 0
      minutes: 5
      seconds: 0
  - platform: numeric_state
    entity_id: sensor.thermostat_cuisine_temperature
    above: 21.5
    for:
      hours: 0
      minutes: 5
      seconds: 0
  - platform: time
    at: "21:59:00"
  - platform: state
    entity_id:
      - binary_sensor.puissance_thermostat_en_haut
    to: "off"
    for:
      hours: 0
      minutes: 5
      seconds: 0
  - platform: state
    entity_id:
      - sensor.defi_hilo
    to: reduction
  - platform: state
    entity_id:
      - input_boolean.heat_boost_stop
    to: "on"
  - platform: numeric_state
    entity_id:
      - sensor.sensorsaloncuisine_temperature
    for:
      hours: 0
      minutes: 5
      seconds: 0
    above: 22.5
condition:
  - condition: state
    entity_id: input_boolean.raise_temperature_on
    state: "on"
action:
  - service: climate.set_temperature
    data:
      temperature: 22
    target:
      device_id: 64bfa1752510b1baf9d7130e80a621ed
  - service: input_boolean.turn_off
    data: {}
    target:
      entity_id:
        - input_boolean.raise_temperature_on
        - input_boolean.prechauffe_pm_debutee
        - input_boolean.heat_boost_stop
  - service: climate.set_preset_mode
    data:
      preset_mode: none
    target:
      entity_id: climate.climatisation
mode: single
ic-dev21 commented 3 months ago

I have yet to fully step through it, though. My day job is hectic right now, so I only have a little time for HA. I'll take another look before the weekend if I get free cycles. From a cursory review, I don't think so since your automation doesn't look like you have any until or while conditions.

However, your automation is calling several other automations, so please post those as well.

Updated to 2024.4.2 and will be actively monitoring it. Will try and run profiler before it locks up if it does.

mekaneck commented 3 months ago

You have a loop in your first automation; once automation.raise_temperature is triggered by your automation, it will continue triggering as fast as the system can loop.

You have two of these blocks in your first automation:

          - wait_for_trigger: []
            timeout:
              hours: 0
              minutes: 5
              seconds: 0
              milliseconds: 0
            continue_on_timeout: false

This block does nothing since you don't specify a trigger. I just tested this, and HA will interpret a null trigger as one that triggered and so the automation will continue immediately without delay.

Were there formerly triggers specified here?

ic-dev21 commented 3 months ago

It should be the unavailable trigger id in there.

Le lun. 8 avr. 2024 à 19:02, Rick Auch @.***> a écrit :

You have a loop in your first automation; once automation.raise_temperature is triggered by your automation, it will continue triggering as fast as the system can loop.

You have two of these blocks in your first automation:

      - wait_for_trigger: []
        timeout:
          hours: 0
          minutes: 5
          seconds: 0
          milliseconds: 0
        continue_on_timeout: false

This block does nothing since you don't specify a trigger. I just tested this, and HA will interpret a null trigger as one that triggered and so the automation will continue immediately without delay.

Were there formerly triggers specified here?

— Reply to this email directly, view it on GitHub https://github.com/home-assistant/core/issues/115072#issuecomment-2043775925, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZZGCFP6NDYO5YMP22OGHVTY4MOX7AVCNFSM6AAAAABF25RMP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBTG43TKOJSGU . You are receiving this because you authored the thread.Message ID: @.***>

richard-scott commented 3 months ago
Screenshot 2024-04-09 at 17 30 38

I upgraded to 2024.4 and had to restore from a backup as my mobile device UI is unusable. Going back to the previous release all worked for a bit and then I decided to try an upgrade again. I'm currently running 2024.4.2.

richard-scott commented 3 months ago

I think I've found why my system was slow. but I have no idea what caused it. My fix was adding more RAM. With no changes other than an upgrade to 2024.4.0 I've had to bump up the RAM allocated to my VM from 8GB to 16GB. When I have it set to 8GB I now have around 5GB of swap space being used!

richard-scott commented 3 months ago

My issues have all been caused by a bug in the timer logic in 2024.4.x releases. The issue for me is due to having a timer as a trigger in an automation. When the timer changes from active to idle the automation is triggered. The trouble is, I need to add a delay of 1 second after the timer went idle before triggering otherwise HA keeps triggering the automation.

trigger:
  - platform: state
    entity_id:
      - timer.office_lights
    for:
      hours: 0
      minutes: 0
      seconds: 1
    to: idle
    from: active
richard-scott commented 3 months ago
Screenshot 2024-04-10 at 11 55 54

You can see in this pic, the drop on CPU demand at around 11:41 when I made the change.

I've still yet to figure out why it now has a 90% CPU demand around every 15min tho :-(

mekaneck commented 3 months ago

I need to add a delay of 1 second after the timer went idle before triggering otherwise HA keeps triggering the automation.

trigger:
  - platform: state
    entity_id:
      - timer.office_lights
    for:
      hours: 0
      minutes: 0
      seconds: 1
    to: idle
    from: active

Sounds like you also have a loop. Can you share the full automation?

2024.4 has included some significant improvements to speed, and therefore is uncovering some issues with users' incorrectly programmed automations & scripts where loops without interrupts exist.

richard-scott commented 3 months ago

I think I found it. In one part of my routine, I call timer.cancel on the office_lights timer. This seems to send the timer to active and back to idle again even when the timer was already idle. That was triggering the automation again.

alias: Automation - Office Occupancy
description: ""
trigger:
  - platform: state
    entity_id:
      - sensor.office_motion_sensor
    to: "1"
  - platform: state
    entity_id:
      - timer.office_lights
    for:
      hours: 0
      minutes: 0
      seconds: 0
    to: idle
    from: active
condition: []
action:
  - if:
      - condition: state
        entity_id: input_boolean.office_occupied
        state: "off"
    then:
      - service: input_boolean.turn_on
        metadata: {}
        data: {}
        target:
          entity_id:
            - input_boolean.office_occupied
    else:
      - if:
          - condition: state
            entity_id: timer.office_lights
            state: idle
        then:
          - service: input_boolean.turn_off
            metadata: {}
            data: {}
            target:
              entity_id: input_boolean.office_occupied
  - if:
      - condition: state
        entity_id: input_boolean.office_occupied
        state: "on"
    then:
      - if:
          - condition: state
            entity_id: timer.livingroom_occupied
            state: active
        then:
          - service: timer.cancel
            target:
              entity_id: timer.livingroom_occupied
            data: {}
      - if:
          - condition: time
            after: "00:00:00"
            before: "06:00:00"
            weekday:
              - sun
              - sat
              - fri
              - thu
              - wed
              - tue
              - mon
        then:
          - service: timer.start
            metadata: {}
            data:
              duration: "0:1:0"
            target:
              entity_id: timer.office_lights
        else:
          - service: timer.start
            metadata: {}
            data: {}
            target:
              entity_id: timer.office_lights
    else:
      - service: timer.finish
        metadata: {}
        data: {}
        target:
          entity_id: timer.office_lights
mode: single
ic-dev21 commented 3 months ago

@bdraco I think the leading theory here is that the way timers are handled change and would cause loops.

I corrected my automation by creating another automation to act as a trigger in my wait statement. This seems to have fixed the issue.

Would you like me to close the issue or would you like to use it to improve documentation? I can try my hand at better documenting this and making a PR for it if you want, just point me in the general direction of where.

richard-scott commented 3 months ago

Hmmm interesting, can you guess when I updated to the latest HASSOS v12.2

Screenshot 2024-04-12 at 09 33 18
issue-triage-workflows[bot] commented 2 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.