home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.4k stars 30.65k forks source link

High CPU and memory usage under 0.108.0 #33866

Closed McGiverGim closed 4 years ago

McGiverGim commented 4 years ago

The problem

Since the installation of HA version 0.108.0 the CPU and memory usage in my system grows. I have restarted several times (HA and the raspberry) but this does not fix nothing.

This graph shows the difference from 0.107.7 to 0.108.0:

image

You can see the low CPU in blue and the memory stable at about 30% in red until the installation of 0.108.0. Since then the memory and the CPU grows in the time. I have restarted it several times without luck, as you see in the graph.

My first idea is to go back to 0.107.7 but I prefer to comment this here because maybe there is a bug and I can help giving information.

Environment

Problem-relevant configuration.yaml

I don't know what to attach. My complete configuration?

Traceback/Error logs

I can't see any error, only the Brother that can't find the powered off printer, but this was the same in 0.107.7.

2020-04-09 08:39:56 ERROR (MainThread) [homeassistant.components.brother] Error fetching brother data: No SNMP response received before timeout
2020-04-09 08:39:56 WARNING (MainThread) [homeassistant.config_entries] Config entry for brother not ready yet. Retrying in 80 seconds.

Additional information

I have some custom integration, but they were there in earlier versions too:

McGiverGim commented 4 years ago

I have bad news... Mine seems has started to eat memory, without the Brother integration 😭

McGiverGim commented 4 years ago

I have bad news... Mine seems has started to eat memory, without the Brother integration 😭

Maybe is a different problem now... the memory grow but the CPU remained stable, and seeing the processes inside the homeassistant container I can see two homeassistant processes: image

I have restarted the server again, the second process has disappeared and now all seems stable again... I will inform later.

McGiverGim commented 4 years ago

Three hours later I can confirm is smooth and stable, so is clear that the problem is the Brother integration or is more general and has something to do with sensors powered off.

ayufan commented 4 years ago

I re-enabled Brother. I see increasing CPU-usage. This seems overall fine, the message:

Apr 16 15:55:22 hass hass[593]: 2020-04-16 15:55:22 ERROR (MainThread) [homeassistant.components.brother] Error fetching brother data: No SNMP response received before timeout
Apr 16 15:55:22 hass hass[593]: 2020-04-16 15:55:22 WARNING (MainThread) [homeassistant.config_entries] Config entry for brother not ready yet. Retrying in 80 seconds.
Apr 16 15:57:10 hass hass[593]: 2020-04-16 15:57:10 ERROR (MainThread) [homeassistant.components.brother] Error fetching brother data: No SNMP response received before timeout
Apr 16 15:57:10 hass hass[593]: 2020-04-16 15:57:10 WARNING (MainThread) [homeassistant.config_entries] Config entry for brother not ready yet. Retrying in 80 seconds.
Apr 16 15:58:59 hass hass[593]: 2020-04-16 15:58:59 ERROR (MainThread) [homeassistant.components.brother] Error fetching brother data: No SNMP response received before timeout
Apr 16 15:58:59 hass hass[593]: 2020-04-16 15:58:59 WARNING (MainThread) [homeassistant.config_entries] Config entry for brother not ready yet. Retrying in 80 seconds.

OK. I enabled debug logs on brother integration:

Apr 16 16:00:48 hass hass[593]: 2020-04-16 16:00:48 ERROR (MainThread) [homeassistant.components.brother] Error fetching brother data: No SNMP response received before timeout
Apr 16 16:00:48 hass hass[593]: 2020-04-16 16:00:48 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 28.083 seconds
Apr 16 16:00:48 hass hass[593]: 2020-04-16 16:00:48 WARNING (MainThread) [homeassistant.config_entries] Config entry for brother not ready yet. Retrying in 80 seconds.
Apr 16 16:00:51 hass hass[593]: 2020-04-16 16:00:51 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 27.596 seconds
Apr 16 16:00:52 hass hass[593]: 2020-04-16 16:00:52 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 27.597 seconds
Apr 16 16:00:53 hass hass[593]: 2020-04-16 16:00:53 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 27.593 seconds
Apr 16 16:00:53 hass hass[593]: 2020-04-16 16:00:53 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 27.411 seconds
Apr 16 16:00:58 hass hass[593]: 2020-04-16 16:00:58 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 27.752 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 27.459 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 27.287 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 27.107 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 26.936 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 26.749 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 26.577 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 26.428 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 26.259 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 26.084 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 25.909 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 25.813 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 25.647 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 25.400 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 25.208 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 25.031 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 24.863 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 24.691 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 24.521 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 24.351 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 24.183 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 24.013 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 23.843 seconds
Apr 16 16:00:59 hass hass[593]: 2020-04-16 16:00:59 DEBUG (MainThread) [homeassistant.components.brother] Finished fetching brother data in 23.679 seconds

This does not seem right, it seems that there's a number of enqueued watchers, but it could be as well a number of sensors watching on a data.

However, this also leads me to the conclusion that brother sensors might enqueue a number of new queries against a coordinator, as usually they would be fast to respond, but it is not true. Now, this creates a case where, we have 20 listeners (each sensor, 20 is a some random number number, as I don't know how many are there). Now, each of the sensors wants to update its state, so it enqueues await. Now, this times out, resulting each sensor (as each sensor did enqueue async op) to update all listeners.

@bieniu I wonder about above :)

bieniu commented 4 years ago

After 4 hours from turning off the printer:

image

image

ayufan commented 4 years ago

@bieniu I wonder, I noticed that it might have a difference in behaviour between if during startup brother is up or not. I was mostly testing that brother was down while starting HASS, so it was shutdown before the HASS restart. Maybe, if you shutdown it while it was HASS is running, some data might be still cached.

bieniu commented 4 years ago

@ayufan Brother sensors don't pool for data. They have should_pool property set to False: https://github.com/home-assistant/core/blob/dev/homeassistant/components/brother/sensor.py#L128-L131

Integration uses DataUpdateCoordinator class and those log entries are from this class.

brother library uses timeout 2 sec and retries 10 times so fetching data about 20 seconds when device is offline is normal.

McGiverGim commented 4 years ago

In not too sure what this message implies, but they seem too much for the same second. I can confirm my system continues stable after several hours without brother integration.

ayufan commented 4 years ago

@bieniu

As far as I can see, they do request data from coordinator which might enqueue multiple data requests. Now, since coordinator registers listeners it is likely that each sensor might request a new await. I have no idea if this is problem with Brother or the DataUpdateCoordinator, but symptoms are as here: https://github.com/home-assistant/core/issues/33866#issuecomment-614741425. It seems also off that we seem to consume a multiple results at the same time, this does mean that multiple awaits are enqueued and finish at the same time.

Now, after re-testing again with Brother power off, HASS restarted I clearly see an increasing in CPU usage, something that is not happening when Brother integration is disabled: image.

Can you re-test on your side with the following conditions? For reference I run HASS on armv7l, so it might take significantly longer to see increasing CPU usage on much more beefy processors (like amd64).

bieniu commented 4 years ago

OK guys, I have to admit that something is wrong. I added _LOGGER.debug(f"update: {self}") to _async_update_data method of BrotherDataUpdateCoordinator class. When I start HA when the printer is offline after some time I have in log:

2020-04-16 22:49:43 DEBUG (MainThread) [homeassistant.components.brother] update: <homeassistant.components.brother.BrotherDataUpdateCoordinator object at 0x7f332346adc0>
2020-04-16 22:50:04 DEBUG (MainThread) [homeassistant.components.brother] update: <homeassistant.components.brother.BrotherDataUpdateCoordinator object at 0x7f32e6e6bc70>
2020-04-16 22:50:04 DEBUG (MainThread) [homeassistant.components.brother] update: <homeassistant.components.brother.BrotherDataUpdateCoordinator object at 0x7f335794db50>
2020-04-16 22:50:56 DEBUG (MainThread) [homeassistant.components.brother] update: <homeassistant.components.brother.BrotherDataUpdateCoordinator object at 0x7f32e6ab8160>

Four different coordinator objects are trying to update the data from the printer. I think that every attempt to configure brother integrations adds new coordinator object. Each coordinator object is trying to update the data from the printer and it takes memory and CPU time.

EDIT: This piece of code is to blame for that: https://github.com/home-assistant/core/blob/dev/homeassistant/components/brother/__init__.py#L33-L37 Each integration that uses DataUpdateCoordinator works the same way.

McGiverGim commented 4 years ago

Good catch @bieniu ! Let's go for the fix! 😁

bieniu commented 4 years ago

Fix: https://github.com/home-assistant/core/pull/34317

choeflake commented 4 years ago

Ok, but I have high CPU and no brother integration. So I think the problem is more general. My log is full of messages about updating components (see above). My setup uses MQTT and Zibee2MQTT plus Node Red.

chosten commented 4 years ago

The fix applies to all integration using DataUpdateCoordinator. Hopefully it will fix our problem too.

ayufan commented 4 years ago

I updated my install. Will monitor :) Thanks @bieniu :)

So far graphs are significantly more healthy.

D43m0n666 commented 4 years ago

I have the Brother integration, but I think remember I installed it after the problem, but I will try later without it. @D43m0n666 waiting for your test...

Disabling Brother integration it seems ok with latest release 108.5, FOR NOW! I can give you better news in 24H, thanks @ayufan for suggestion!

Yes, i confirm problem was Brother integration. Now it's ok, i've also updated to latest version!

Thanks to all!

bieniu commented 4 years ago

@D43m0n666 This was a Data Update Coordinator issue and affected all integrations using it.

ayufan commented 4 years ago

So far, everything seems stable.

On Fri, Apr 17, 2020 at 2:47 PM Maciej Bieniek notifications@github.com wrote:

@D43m0n666 https://github.com/D43m0n666 This was a Data Update Coordinator issue and affected all integrations using it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/home-assistant/core/issues/33866#issuecomment-615224695, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASOSQJVFNCGHPB7I26WNKDRNBFUXANCNFSM4MEP6JRQ .

ayufan commented 4 years ago

It is definitely a magnitude better, but for me it still seems that there's some performance impact of having and not having Brother enabled:

As you can see baseline did jump from ~4.5% to ~6% of CPU usage (given that this is not the only task running there it is in fact 33% increase in CPU usage) after upgrade from 0.108.4 to 0.108.8. The only other difference is I enabled Brother integration.

image

I'm running another test now. I disabled and lets see the difference in next 12/18 hours :)

andriej commented 4 years ago

1,5% percent of difference between having and not having snmp queries in background... what hardware are you on where it matters? ;-)

choeflake commented 4 years ago

Upgraded tot 0.108.6, seems to fix a lot. My CPU is now around 5%. In my opinion this is not 'low', but way better than the previous 60%. Take note that I have a 'Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz' cpu. The 5 min load shows 0.05.

ayufan commented 4 years ago

@andriej @bieniu

Given that this machine is: my router, dual wifi with 5 SSIDs, firewall, VPN server, HASS (with 100 entities), zigbee2mqtt (with 40 entities), Prometheus scraping 15 targets, MQTT, esphome frontend, and pihole it is doing pretty well with that average.

There is a clearly noticeable performance regression when running brother integration with a printer being unreachable. It is very likely that it was as that before, but it still does not make sense to use so much resources by service where there's in fact no data processing.

I will look how performant is brother module :)

Screenshot 2020-04-18 at 12 56 25

EDIT:

It takes around 140ms for me to perform a single cycle of brother.async_update for unavailable device. This time is spend solely on CPU.

EDIT 2:

The problem is related for the cost of running snmp_engine = hlapi.SnmpEngine(). This requires parsing all OIDs database it seems. This takes takes that CPU time. I see around 130ms being spend on just this call.

EDIT 3:

This happens as the whole OID database seems to be imported each time: https://github.com/etingof/pysnmp/blob/master/pysnmp/entity/engine.py#L99.

Likely, it would be better to re-use hlapi.SnmpEngine() across different printers, and initialize it once, instead of each time when requesting data. This is also probably why we saw so siginificant performance regression, as more and more of hlapi.SnmpEngine() ops were queued, and each of them took significant amount of time.

It seems that a single snmp_engine object takes around 40kB in memory, so likely this is fully acceptable given the very high cost of initialization. Considering, that having 130ms initialization time each time to fetch data creates a random latency spikes in responsiveness of Home Assistant, due to GIL of CPython :)

EDIT 4:

Initing snmp_engine once results in the check being instantaneous.

Alessandro1981 commented 4 years ago

Updating to 108.6 fix the issue on my installation: memory and CPU are as expected

McGiverGim commented 4 years ago

All is ok for me. It seems that my cpu is maybe a little bigger (maybe 1%) but I don't know if this can be addressed or where the origin is.

McGiverGim commented 4 years ago

After two days os testing, all seems stable but I see my memory a little high. The home assistant docker is taking more than 400M. It has been taken more memory slowly. The memory total is about 50% (Raspberry Pi 2Gb).

image

Usually I had less memory used, but I'm always changing things, so maybe is simply normal but I comment about this here to see if more people has noticed something. As I say it seems stable and I have no problems.

Gunth commented 4 years ago

Yes, it seems that i have the same issue, memory and swap is growing to 100% for both after 2 days on my RPi3b+

eddetollenaer commented 4 years ago

I have the same problem. I see a clear increase of memory usage every hour during the night when not much is going on rather than one automation that determines the max day temperature every 15 minutes. I already set the recorder to only record the automations and two sensors because the recorder send to eat up a lot of memory.

D94C92AD-B0B4-41A4-8D05-BFA3A535FBC9

bullshitduckdnsaccount commented 4 years ago

I had this same issue as well, starting in 0.108.0 and the changes in 0.108.6 did fix it for me.

The issue then returned in 0.109.0 and has consisted through 0.110.x. Both CPU and RAM will creep up and then eventually bring HA to a standstill until rebooted.

norbertomartins commented 4 years ago

I have the version 0.110.3 and have the same issue. I created a trigger to reboot if memory is below 110 for more than 2 minutes. Sometimes this occurs during the night.

homeassistant_low_memory

andriej commented 4 years ago

Ain't it google backup addon doing it's job?

bdraco commented 4 years ago

Please post a py-spy recording https://github.com/benfred/py-spy

py-spy record --pid 200 --output issue33866.svg --duration 120

You'll have to change the pid to your local instance's pid.

stale[bot] commented 4 years ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue now has been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.