home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
74.15k stars 31.13k forks source link

Issue after DST time change [update to 2021.10.7+ or 2021.11.0b4+ recommended] #58783

Closed chneau closed 3 years ago

chneau commented 3 years ago

The problem

In UK 2021/10/31, at 01:59:59, time got back to 01:00:00 (summer to winter, Daylight saving), since then (it's 01:08) home-assistant has a high CPU usage, using a core at 100%.

CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O         PIDS
42985e0497d4   hass      104.53%   251.7MiB / 7.658GiB   3.21%     0B / 0B   103MB / 1.77MB    15

Edit: memory usage seems to increase quickly:

at 01:14:00

CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O         PIDS
42985e0497d4   hass      104.93%   703.1MiB / 7.658GiB   8.97%     0B / 0B   112MB / 1.98MB    16

Edit2: Switching lights work fine but it does not appear on the state history of the light.

What version of Home Assistant Core has the issue?

core-2021.10.6

REPOSITORY                     TAG       IMAGE ID       CREATED         SIZE
homeassistant/home-assistant   stable    e0a45773808a   12 days ago     1.14GB

I could not find the exact image id on docker hub, but here is the label section of docker inspect

"io.hass.arch": "amd64",
"io.hass.base.arch": "amd64",
"io.hass.base.image": "homeassistant/amd64-base:3.14",
"io.hass.base.name": "python",
"io.hass.base.version": "2021.09.1",
"io.hass.type": "core",
"io.hass.version": "2021.10.6",
"org.opencontainers.image.authors": "The Home Assistant Authors",
"org.opencontainers.image.created": "2021-10-18 06:34:53+00:00",
"org.opencontainers.image.description": "Open-source home automation platform running on Python 3",
"org.opencontainers.image.documentation": "https://www.home-assistant.io/docs/",
"org.opencontainers.image.licenses": "Apache License 2.0",
"org.opencontainers.image.source": "https://github.com/home-assistant/core",
"org.opencontainers.image.title": "Home Assistant",
"org.opencontainers.image.url": "https://www.home-assistant.io/",
"org.opencontainers.image.version": "2021.10.6"

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Container

Integration causing the issue

No response

Link to integration documentation on our website

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Interesting The recorder queue reached the maximum size of 30000

2021-10-30T09:40:10.884156228Z 2021-10-30 10:40:10 WARNING (MainThread) [homeassistant.components.websocket_api.http.connection] [139778345277952] Disconnected: Did not receive auth message within 10 seconds
2021-10-30T09:40:22.323961416Z 2021-10-30 10:40:22 WARNING (MainThread) [homeassistant.components.webhook] Received message for unregistered webhook c9fa7b5955dcce6df0ec16e14a28b23623563b96373bc5a66c0413c418093008 from 192.168.1.117
2021-10-31T01:03:30.660640416Z 2021-10-31 01:03:30 ERROR (MainThread) [homeassistant.components.recorder] The recorder queue reached the maximum size of 30000; Events are no longer being recorded
2021-10-31T01:04:57.487128770Z [cont-finish.d] executing container finish scripts...
2021-10-31T01:04:57.489476430Z [cont-finish.d] done.

at 2021-10-31T01:03:30.660640416Z I restarted the container to see if it could fix the issue, it did not.



### Additional information

Maybe after 02:00:00 it will stop?
Everything is working properly: light switches, the mobile phone app is working properly, the website served by the container (server:8123) is working properly.
Restarting the container or restarting the PC does not solve the high CPU usage
mdeweerd commented 3 years ago

@mdeweerd just to be clear, you write that "I have an unreachable HA system". Did it not start working again after a manual restart?

It's remote from my current location. I do not often go there, the idea is to control some stuff remotely, like heating the place up before I go there.

I can't connect(login) to the HA setup : the Web UI is not working and I have not set up the SSH connectivity yet .

mdeweerd commented 3 years ago

(regarding the watchdog) I got it. But at the same time other needs should be full-filled, in order to back-trace the issue. Currently even HA logs are reset by each restart. Keep in mind that HA devs are not prone to develop anything without detailed issue description. I could imagine the DST issue would never get traction if all impacted instances have silently restarted itself.

HA devs added the possibility to correct historical data in Developer Tools>Statistics . As far as I understood from the release video for 2021.10 they saw a lot of discussion providing SQL Queries to users on how to do that, and that is not what they want. I can't speak for them, but I think the goal is that User of HA do not need to be specialists. Having access to the logs already requires quite some configuration in itself (file editor, terminal or ssh access).

I think it is better to keep a system going rather than having it out of service until the user notices it and can intervene.
The members of my family will quickly request to remove all that domotic s**t and come back to plain old household equipement management. It's already difficult to introduce it.

Home Assistant log files are rotated, so in principle you can find the one before the restart.
A notification could inform about the restart and suggest to make some report about it by providing the relevant log file.
The supervisor could also log information somehow about the restart reason.
I sure that supervision will be based on options and not everbody has a supervisor (I would be setting up monit if I weren't using HA OS).

For now, I'll be going to a cold place that requires about 1 hour to heat up 😞 .

mdeweerd commented 3 years ago

Is the data gap between 1AM and 2AM been fixed?

This is most likely a side effect of the recorder being overrun; the hourly statistics for 01:00~02:00 is compiled after 02:00 when things had already gone south.

Improving the supervisor to detect this kind of problem makes sense. Please open an issue here: https://github.com/home-assistant/supervisor/issues

Well, the first time it's 2 AM, the clock has not been shifted back yet, the problem presumably occurs when the clock shifts back from 3 AM to 2 AM. Or, is the fix explaining that the issue already happens at the first time it's 2AM (when the statistics for 1 AM to 2 AM are collected, not for 2 AM to 3 AM becoming 2 AM).

ChristophCaina commented 3 years ago

Is the data gap between 1AM and 2AM been fixed?

This is most likely a side effect of the recorder being overrun; the hourly statistics for 01:00~02:00 is compiled after 02:00 when things had already gone south. Improving the supervisor to detect this kind of problem makes sense. Please open an issue here: https://github.com/home-assistant/supervisor/issues

Well, the first time it's 2 AM, the clock has not been shifted back yet, the problem presumably occurs when the clock shifts back from 3 AM to 2 AM. Or, is the fix explaining that the issue already happens at the first time it's 2AM (when the statistics for 1 AM to 2 AM are collected, not for 2 AM to 3 AM becoming 2 AM).

I think, the issue can be described as the following:

the system is generating statistics (short term and long term) during 02:00 and 03:00 - and when the clock jumps back to 02:00 there are already statistivs available for that period... (and probably stats going into the future for the system)...

emontnemery commented 3 years ago

the system is generating statistics (short term and long term) during 02:00 and 03:00 - and when the clock jumps back to 02:00 there are already statistics available for that period... (and probably stats going into the future for the system)...

That's not the case, statistics' timestamps are in UTC, not local time. There might be something else going on here, possibly due to a frontend bug, could some of you with a hole in the statistics as a result of this bug please share a dump of the statistics tables, let's say October 31st 00:00 ~ October 31st 04:00 local time?

emontnemery commented 3 years ago

FWIW, this will allow the supervisor to check the health of the recorder: https://github.com/home-assistant/core/pull/58989

emontnemery commented 3 years ago

I could imagine the DST issue would never get traction if all impacted instances have silently restarted itself.

A silent restart once a year would be unwanted but not blocking IMHO

mdeweerd commented 3 years ago

As I am setting up SSH on the system that I could not access remotely, I discover that the add-on allows installation of 'apks'. I added monit and it's added. Then I added nmap to scan addresses, added as well ☺️ .

So that should allow me to:

So I'll be doing some 'poor mans monitoring' on my systems 😄 . Once done, I'll share my configuration on the forums.

borpin commented 3 years ago

Home Assistant log files are rotated, so in principle you can find the one before the restart.

Well they never appear in the UI to look at apart from the .1 log.

It needs both Core and Supervisor logs to be available as well and not just the last one, several. All my other systems have at least a week's worth of logs available.

borpin commented 3 years ago

As I am setting up SSH on the system that I could not access remotely, I discover that the add-on allows installation of 'apks'.

Interesting, can you expand - possibly on the forum? I'd love to have monit installed :)

emontnemery commented 3 years ago

@mdeweerd did you diagnose why HA is unreachable for you? The supervisor watchdog is contacting HA over https, and will force it to restart if it doesn't reply. Was something else, nginx for example, killed or starved by HA going crazy?

mdeweerd commented 3 years ago

@emontnemery In the mean time I am on location and I powercycled the system and it's up. nginx was still up but could visibly not contact home assistant itself. The last line in the log just indicates what other users reported:

2021-10-31 02:02:33 ERROR (MainThread) [homeassistant.components.recorder] The recorder queue reached the maximum size of 30000; Events are no longer being recorded

I provided more information about my tests in this comment .

Maybe the supervisor was happy with the reply from nginx which did indicate an error, but returned a reply.

@borpin The following adds monit and nmap . I also tried sqlite3 but then the "Terminal + SSH" addon did not start correctly. So some packages will work, others not.

authorized_keys:  []
apks:
  - monit
  - nmap
password: ''
server:
  tcp_forwarding: false
mdeweerd commented 3 years ago

@emontnemery Here is the statistics data from two HA systems for times close to the time change. datemissingBeforeTimeChange.zip

The hour missing in statistics is UTC 2021-10-30 23:00 to UTC 2021-10-30 23:59 or Local time (Paris) 2021-10-31 01:00 to UTC 2021-10-30 01:59 which is the 2nd hour before the time change at 3:00 local time.
As said, I would understand that the timechange at 3AM would create an issue with the data from UTC 2021-10-31 00:00 to UTC 2021-10-31 00:59 which is the hour preceding the time change, but I find it strange that it impacts the hour before that.

Home system

1 hour missing in statistics:

INSERT INTO statistics VALUES(44003,'2021-10-30 23:00:10.471482',56,'2021-10-30 22:00:00.000000',NULL,NULL,NULL,'2021-10-27 20:45:39.640551',4.0999999999996896704,32.448960000001861204);
INSERT INTO statistics VALUES(44004,'2021-10-31 01:00:12.683680',1,'2021-10-31 00:00:00.000000',55.999999999999999999,55.999999999999999999,55.999999999999999999,NULL,NULL,NULL);

1 hour missing in statistics_short_term:

INSERT INTO statistics_short_term VALUES(225453,'2021-10-30 23:55:10.351732','2021-10-30 23:50:00.000000',NULL,NULL,NULL,NULL,6786.2500000000000001,207.34799999999995634,54);
INSERT INTO statistics_short_term VALUES(225454,'2021-10-31 01:00:12.341185','2021-10-31 00:55:00.000000',3373.9999999999999999,3373.9999999999999999,3373.9999999999999999,NULL,NULL,NULL,21);

Remote system.

1 hour missing in statistics:

INSERT INTO statistics VALUES(46709,'2021-10-30 23:00:10.753588',78,'2021-10-30 22:00:00.000000',NULL,NULL,NULL,NULL,0.0,0.0);
INSERT INTO statistics VALUES(46710,'2021-10-31 01:00:32.360636',7,'2021-10-31 00:00:00.000000',81.999999999999999998,81.999999999999999998,81.999999999999999998,NULL,NULL,NULL);

One hour gap in statistics_short_term:

INSERT INTO statistics_short_term VALUES(277072,'2021-10-30 23:55:10.651738','2021-10-30 23:50:00.000000',NULL,NULL,NULL,NULL,437.896000000000015,20.564000000000021372,41);
INSERT INTO statistics_short_term VALUES(277073,'2021-10-31 01:00:19.469394','2021-10-31 00:55:00.000000',0.0,0.0,0.0,NULL,NULL,NULL,42);

Time zone information:

1 hour before the time change

Epoch timestamp: 1635638400 Timestamp in milliseconds: 1635638400000 Date and time (GMT): Sunday 31 October 2021 00:00:00 Date and time (your time zone): dimanche 31 octobre 2021 02:00:00 GMT+02:00

Just before the time change

Epoch timestamp: 1635641999 Timestamp in milliseconds: 1635641999000 Date and time (GMT): Sunday 31 October 2021 00:59:59 Date and time (your time zone): dimanche 31 octobre 2021 02:59:59 GMT+02:00

At the time change.

Epoch timestamp: 1635642000 Timestamp in milliseconds: 1635642000000 Date and time (GMT): Sunday 31 October 2021 01:00:00 Date and time (your time zone): dimanche 31 octobre 2021 02:00:00 GMT+01:00

borpin commented 3 years ago

The following adds monit and nmap

@mdeweerd - how do you configure monit and access it then? Is this documented anywhere? Cheers for this :)

mdeweerd commented 3 years ago

@borpin I created a topic on the forum - it's better to continue that discussion there.