home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.64k stars 30.4k forks source link

HASS Core freezing and unstable #1757

Closed Luc3as closed 8 years ago

Luc3as commented 8 years ago

Feature requests should go in the forum: https://community.home-assistant.io/c/feature-requests

Home Assistant release (hass --version): 0.16.1

Python release (python3 --version): Python 3.4.2

Component/platform: Homeassisstant core

Description of problem: My homeassistant is really unstable, and freezes quite often, I have very basic setup with just 2 components as you can see in my config attached, and after restart it is working normally. but after several hours, it just freezes and nothing is updating anymore, I cannot find any special event in the log what could in my opinion cause this freezing and problems. I know there was some similiar problem with MQTT sensors but I dont have any MQTT in config.

Expected:

Problem-relevant configuration.yaml entries and steps to reproduce:

homeassistant:
  # Name of the location where Home Assistant is running
  name: Porubčanovci
  # C for Celcius, F for Fahrenheit
  temperature_unit: C
  # Pick yours from here: http://en.wikipedia.org/wiki/List_of_tz_database_time_zones
  time_zone: Europe/Bratislava

  # Location required to calculate the time the sun rises and sets
  latitude: 48.985949
  longitude: 18.2209636

http:
  api_password: password

# View all events in a logbook
#logbook:

# Checks for available updates
updater:

# Enables support for tracking state changes over time.
history:

# Discover some devices automatically
discovery:

# Enables the frontend
frontend:

# Allows you to issue voice commands from the frontend
conversation:

# Show links to resources in log and frontend
#introduction:

# Track the sun
sun:

device_tracker 1:
  platform: asuswrt
  host: 192.168.1.10
  username: admin
  password: password

  # If new discovered devices are tracked by default (default: yes)
  track_new_devices: yes
  # Seconds between each scan for new devices (default: 12)
  interval_seconds: 120
  # Seconds to wait till marking someone as not home after not being seen
  # (default: 180)
  consider_home: 360

sensor 1:
  platform: systemmonitor
  resources:
    - type: memory_use_percent
    - type: processor_use
    - type: since_last_boot

Traceback (if applicable):

-- Logs begin at Wed 2016-04-06 15:07:00 CEST. --
Apr 09 10:21:46 luc3as-ha hass[22232]: WARNING:homeassistant.core:WorkerPool:Current job from 20:40:31 08-04-2016: (<function _handle_get_api_stream.<locals>.forward_events at 0xb60da030>, <Event state_changed[L]: old_state=<state sun.sun=below_horizon; next_setting=17:30:33 09-04-2016, elevation=-11.67, next_rising=04:06:26 09-04-2016, friendly_name=Sun @ 19:29:04 08-04-2016>, new_state=<state sun.sun=below_horizon; next_setting=17:30:33 09-04-2016, elevation=-11.82, next_rising=04:06:26 09-04-2016, friendly_name=Sun @ 19:29:04 08-04-2016>, entity_id=sun.sun>)
Apr 09 10:21:46 luc3as-ha hass[22232]: WARNING:homeassistant.core:WorkerPool:Current job from 20:41:00 08-04-2016: (<function _handle_get_api_stream.<locals>.forward_events at 0xb60da030>, <Event state_changed[L]: old_state=<state sensor.cpu_use=9; unit_of_measurement=%, icon=mdi:memory, friendly_name=CPU Use @ 20:40:30 08-04-2016>, new_state=<state sensor.cpu_use=14; unit_of_measurement=%, icon=mdi:memory, friendly_name=CPU Use @ 20:41:00 08-04-2016>, entity_id=sensor.cpu_use>)
Apr 09 10:21:46 luc3as-ha hass[22232]: WARNING:homeassistant.core:WorkerPool:Current job from 20:41:00 08-04-2016: (<function _handle_get_api_stream.<locals>.forward_events at 0xb60da030>, <Event state_changed[L]: old_state=<state sensor.since_last_boot=3 days, 23:03:44.607476; icon=mdi:clock, friendly_name=Since Last Boot @ 20:40:30 08-04-2016>, new_state=<state sensor.since_last_boot=3 days, 23:04:14.563010; icon=mdi:clock, friendly_name=Since Last Boot @ 20:41:00 08-04-2016>, entity_id=sensor.since_last_boot>)
Apr 09 10:21:46 luc3as-ha hass[22232]: WARNING:homeassistant.core:WorkerPool:Current job from 20:41:30 08-04-2016: (<function _handle_get_api_stream.<locals>.forward_events at 0xb60da030>, <Event state_changed[L]: old_state=<state sensor.cpu_use=14; unit_of_measurement=%, icon=mdi:memory, friendly_name=CPU Use @ 20:41:00 08-04-2016>, new_state=<state sensor.cpu_use=10; unit_of_measurement=%, icon=mdi:memory, friendly_name=CPU Use @ 20:41:30 08-04-2016>, entity_id=sensor.cpu_use>)
Apr 09 10:21:46 luc3as-ha hass[22232]: WARNING:homeassistant.core:WorkerPool:Current job from 20:41:30 08-04-2016: (<function _handle_get_api_stream.<locals>.forward_events at 0xb60da030>, <Event state_changed[L]: old_state=<state sensor.since_last_boot=3 days, 23:04:14.563010; icon=mdi:clock, friendly_name=Since Last Boot @ 20:41:00 08-04-2016>, new_state=<state sensor.since_last_boot=3 days, 23:04:44.611603; icon=mdi:clock, friendly_name=Since Last Boot @ 20:41:30 08-04-2016>, entity_id=sensor.since_last_boot>)
Apr 09 10:21:46 luc3as-ha hass[22232]: WARNING:homeassistant.core:WorkerPool:Current job from 20:41:31 08-04-2016: (<function _handle_get_api_stream.<locals>.forward_events at 0xb60da030>, <Event state_changed[L]: old_state=<state sun.sun=below_horizon; next_setting=17:30:33 09-04-2016, elevation=-11.82, next_rising=04:06:26 09-04-2016, friendly_name=Sun @ 19:29:04 08-04-2016>, new_state=<state sun.sun=below_horizon; next_setting=17:30:33 09-04-2016, elevation=-11.97, next_rising=04:06:26 09-04-2016, friendly_name=Sun @ 19:29:04 08-04-2016>, entity_id=sun.sun>)
Apr 09 10:21:46 luc3as-ha hass[22232]: WARNING:homeassistant.core:WorkerPool:Current job from 20:42:00 08-04-2016: (<function _handle_get_api_stream.<locals>.forward_events at 0xb60da030>, <Event state_changed[L]: old_state=<state sensor.cpu_use=10; unit_of_measurement=%, icon=mdi:memory, friendly_name=CPU Use @ 20:41:30 08-04-2016>, new_state=<state sensor.cpu_use=12; unit_of_measurement=%, icon=mdi:memory, friendly_name=CPU Use @ 20:42:00 08-04-2016>, entity_id=sensor.cpu_use>)
Apr 09 10:21:46 luc3as-ha hass[22232]: WARNING:homeassistant.core:WorkerPool:Current job from 20:42:00 08-04-2016: (<function _handle_get_api_stream.<locals>.forward_events at 0xb60da030>, <Event state_changed[L]: old_state=<state sensor.since_last_boot=3 days, 23:04:44.611603; icon=mdi:clock, friendly_name=Since Last Boot @ 20:41:30 08-04-2016>, new_state=<state sensor.since_last_boot=3 days, 23:05:14.561402; icon=mdi:clock, friendly_name=Since Last Boot @ 20:42:00 08-04-2016>, entity_id=sensor.since_last_boot>)
Apr 09 10:21:46 luc3as-ha hass[22232]: WARNING:homeassistant.core:WorkerPool:Current job from 20:42:30 08-04-2016: (<function _handle_get_api_stream.<locals>.forward_events at 0xb60da030>, <Event state_changed[L]: old_state=<state sensor.since_last_boot=3 days, 23:05:14.561402; icon=mdi:clock, friendly_name=Since Last Boot @ 20:42:00 08-04-2016>, new_state=<state sensor.since_last_boot=3 days, 23:05:44.579096; icon=mdi:clock, friendly_name=Since Last Boot @ 20:42:30 08-04-2016>, entity_id=sensor.since_last_boot>)
Apr 09 10:21:46 luc3as-ha hass[22232]: WARNING:homeassistant.core:WorkerPool:Current job from 20:42:31 08-04-2016: (<function _handle_get_api_stream.<locals>.forward_events at 0xb60da030>, <Event state_changed[L]: old_state=<state sun.sun=below_horizon; next_setting=17:30:33 09-04-2016, elevation=-11.97, next_rising=04:06:26 09-04-2016, friendly_name=Sun @ 19:29:04 08-04-2016>, new_state=<state sun.sun=below_horizon; next_setting=17:30:33 09-04-2016, elevation=-12.12, next_rising=04:06:26 09-04-2016, friendly_name=Sun @ 19:29:04 08-04-2016>, entity_id=sun.sun>)

Apr 09 10:22:04 luc3as-ha hass[22232]: INFO:homeassistant.components.api:Found broken event stream to 192.168.1.140, cleaning up
Apr 09 10:22:21 luc3as-ha hass[22232]: INFO:homeassistant.components.http:"GET /api/history/period?filter_entity_id=device_tracker.androidc5c1560b6f56546b HTTP/1.1" 200 -
Apr 09 10:23:00 luc3as-ha hass[22232]: INFO:netdisco.service:Scanning
Apr 09 10:23:41 luc3as-ha hass[22232]: INFO:homeassistant.components.http:"GET /api/history/period?filter_entity_id=sensor.ram_use HTTP/1.1" 200 -
Apr 09 10:23:47 luc3as-ha hass[22232]: INFO:homeassistant.components.http:"GET /api/history/period?filter_entity_id=sensor.cpu_use HTTP/1.1" 200 -
Apr 09 10:23:54 luc3as-ha hass[22232]: INFO:homeassistant.components.http:"GET /api/history/period?filter_entity_id=sensor.since_last_boot HTTP/1.1" 200 -
Apr 09 10:25:28 luc3as-ha hass[22232]: INFO:homeassistant.components.http:"GET /api/error_log HTTP/1.1" 200 -
Apr 09 10:28:18 luc3as-ha hass[22232]: INFO:netdisco.service:Scanning

Additional info:

balloob commented 8 years ago

That's weird that the API would get stuck writing to the browser. Which browser do you use ?

Luc3as commented 8 years ago

I think it is not writing to database. Because after restarting there are just frozen states, not like I am not able to see actual states, the actual states aren't getting from devices or at least not writing to db.i am using newest chrome, and the results are same for chrome on Android Dňa 9.4.2016 6:48 PM používateľ "Paulus Schoutsen" notifications@github.com napísal:

That's weird that the API would get stuck writing to the browser. Which browser do you use ?

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/balloob/home-assistant/issues/1757#issuecomment-207817242

balloob commented 8 years ago

Can it be that you run out of hard drive space?

On Sat, Apr 9, 2016 at 9:51 AM, Luc3as notifications@github.com wrote:

I think it is not writing to database. Because after restarting there are just frozen states, not like I am not able to see actual states, the actual states aren't getting from devices or at least not writing to db.i am using newest chrome, and the results are same for chrome on Android Dňa 9.4.2016 6:48 PM používateľ "Paulus Schoutsen" < notifications@github.com> napísal:

That's weird that the API would get stuck writing to the browser. Which browser do you use ?

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub < https://github.com/balloob/home-assistant/issues/1757#issuecomment-207817242

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/balloob/home-assistant/issues/1757#issuecomment-207817430

PaulusSchoutsen.nl It's nice to be important but it's more important to be nice.

Luc3as commented 8 years ago

Nope, there is 4 GB of free space. It runs on Raspberry pi, and there is literally just HA running, no desktop, no other apps or services Dňa 10.4.2016 12:42 AM používateľ "Paulus Schoutsen" < notifications@github.com> napísal:

Can it be that you run out of hard drive space?

On Sat, Apr 9, 2016 at 9:51 AM, Luc3as notifications@github.com wrote:

I think it is not writing to database. Because after restarting there are just frozen states, not like I am not able to see actual states, the actual states aren't getting from devices or at least not writing to db.i am using newest chrome, and the results are same for chrome on Android Dňa 9.4.2016 6:48 PM používateľ "Paulus Schoutsen" < notifications@github.com> napísal:

That's weird that the API would get stuck writing to the browser. Which browser do you use ?

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub <

https://github.com/balloob/home-assistant/issues/1757#issuecomment-207817242

— You are receiving this because you commented. Reply to this email directly or view it on GitHub < https://github.com/balloob/home-assistant/issues/1757#issuecomment-207817430

PaulusSchoutsen.nl It's nice to be important but it's more important to be nice.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/balloob/home-assistant/issues/1757#issuecomment-207867049

KmanOz commented 8 years ago

LOL I am suffering from the exact same issue. I have eliminated everything that I thought was causing the problems and put it down to maybe because I have about 25 devices (Either Orvibo or mqtt switches) but it seems it doesn't matter how big or small the setup is. I often get this -

16-05-09 15:21:42 homeassistant.core: WorkerPool:All 20 threads are busy and 121 jobs pending

I did have my debian server running from a CF image but have moved it to a HD because I thought that could be an issue. I would sometimes get the message above with 17K jobs pending. When that happens HA is unresponsive.

I thought it was just me, glad it's not.

Peter

balloob commented 8 years ago

It's fine to have 121 jobs pending. Sometimes Home Assistant gets behind on the queue but it will catch up. Especially with 25 devices and a bunch of groups this is not unexpected. 17k jobs however is a problem and that means that something somewhere did not return control of the thread.

bengan commented 8 years ago

I have the same issue, I think. It seems that pending jobs pile up. Dubbling I would say. Here how it looks if i grep for WorkerPool:All. After that I get tracebacks and hass doesn't work properly anymore.

16-05-10 08:05:34 homeassistant.core: WorkerPool:All 21 threads are busy and 64 jobs pending 16-05-10 08:05:39 homeassistant.core: WorkerPool:All 21 threads are busy and 127 jobs pending 16-05-10 08:05:50 homeassistant.core: WorkerPool:All 21 threads are busy and 253 jobs pending 16-05-10 08:06:11 homeassistant.core: WorkerPool:All 21 threads are busy and 505 jobs pending 16-05-10 08:06:53 homeassistant.core: WorkerPool:All 21 threads are busy and 1009 jobs pending 16-05-10 14:16:17 homeassistant.core: WorkerPool:All 21 threads are busy and 2017 jobs pending 16-05-10 14:19:05 homeassistant.core: WorkerPool:All 21 threads are busy and 4033 jobs pending

balloob commented 8 years ago

Whenever experiencing these issues, please note which components and platforms you are using. Around that message in the log Home Assistant will also print which jobs are currently being processed. This could be a good indication of what is going on.

KmanOz commented 8 years ago

Paulus here is some more info. I cannot work out why in this log there are repetitive entries one after the other. This only happens sporadically. Sometimes when I reboot the server I get none of these events and then sometimes it goes crazy, freezes and becomes unresponsive. System can be fine for days and then out of the blue goes nuts. If it was a config error surely it would happen immediately.

Anyway http://hastebin.com/wawuseloge.lua

KmanOz commented 8 years ago

Here I stopped HA then deleted home-assistant.db & home-assistant.log. I then rebooted and the home-assistant.log filled up almost instantly. Again repetition on what seems to me the same event. Now on other occasions I can restart the server and NOTHING except 1 or 2 lines and it woks perfectly for a set period of time. http://hastebin.com/uxekebuhel.pas

balloob commented 8 years ago

Groups have a lock so each group will only handle 1 state change of one of it's entities at the same time. Do you have a lot of groups ? Or maybe circular groups (2 groups referring to one another)

On Thu, May 12, 2016 at 5:10 AM, Peter Kyrkos notifications@github.com wrote:

Here I stopped HA then deleted home-assistant.db & home-assistant.log. I then rebooted and the home-assistant.log filled up almost instantly. Again repetition on what seems to me the same event. Now on other occasions I can restart the server and NOTHING except 1 or 2 lines and it woks perfectly for a set period of time. http://hastebin.com/uxekebuhel.pas

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/home-assistant/home-assistant/issues/1757#issuecomment-218738338

PaulusSchoutsen.nl It's nice to be important but it's more important to be nice.

KmanOz commented 8 years ago

You my friend are a champion. Yep found the fault in my groups.yaml. Fixed it and everything is now super. Even Orvibo issue that I have had for a very long time has now disappeared because I'm guessing the core is no longer stuck in a loop in regards to groups. Thanks for the pointer Paulus :dancers:

Luc3as commented 8 years ago

Hello I don't and didn't used any groups in my config and problem is still the same. Dňa 13. 5. 2016 9:08 dopoludnia používateľ "Peter Kyrkos" < notifications@github.com> napísal:

You my friend are a champion. Yep found the fault in my groups.yaml. Fixed it and everything is now super. Even Orvibo issue that I have had for a very long time has now disappeared because I'm guessing the core is no longer stuck in a loop in regards to groups. Thanks for the pointer Paulus [image: :dancers:]

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/home-assistant/home-assistant/issues/1757#issuecomment-218967838

KmanOz commented 8 years ago

Actually I spoke too soon. Nothing has changed. Today I happened to have edited my groups.yaml in the morning and later saw the message above from @balloob. I put the 2 together thinking that I somehow fixed my groups.yaml file by mistake. As it turns out the issues are still there and I had @Bart274 look at my groups.yaml and he seems to think it's fine.

Same is happening s it always has yet today for 4 or 5 hours it worked fine. Nothing has changed between it working and not working.

http://hastebin.com/musowumonu.lua

This is my groups.yaml file. As I said a couple of us looked at it and we think it's fine.

http://hastebin.com/quwogukufa.coffee

KmanOz commented 8 years ago

And just as a contrast this morning I woke up and the first thing I did was reset hass, just like I did in the post above. Nothing else no config changes, just restarted by calling the service and this is the log.

16-05-14 09:19:17 homeassistant.core: WorkerPool:All 11 threads are busy and 34 jobs pending 16-05-14 09:19:19 homeassistant.core: WorkerPool:All 20 threads are busy and 61 jobs pending

That's it, nothing else and I know it will run fine now for a while because it started fine. Something is wrong, maybe with the way it initializes the config files but I'm only guessing and I don't quite know how to troubleshoot it apart from posting logs. Hope this helps somehow.

KmanOz commented 8 years ago

And 3 hours later booooom. It's completely unresponsive. http://hastebin.com/ivefuzuvah

16-05-14 12:11:06 homeassistant.core: WorkerPool:All 20 threads are busy and 15361 jobs pending

How can this be a config issue if for 3 hours it works fine then out of the blue goes haywire? Any suggestions @balloob

Update: so 10 minutes after the event hass has caught up but missed mqtt messages and generally came out of it in a messy state. Anyway I am now going to delete groups.yaml and run it without groups for a while and see what happens.

KmanOz commented 8 years ago

I have removed groups from the config. The same repetitive messages keep coming up.

http://hastebin.com/xuyowovibo.lua

Can anyone help ?

balloob commented 8 years ago

Having a lot of jobs at startup is fine. The problem is that it escalates later. It's as if your system gets in a deadlock and just keeps blocking more threads until you run out.

Try disabling one component or platform at a time to find the culprit

On Fri, May 13, 2016, 23:17 Peter Kyrkos notifications@github.com wrote:

I have removed groups from the config. The same repetitive messages keep coming up.

http://hastebin.com/xuyowovibo.lua

Can anyone help ?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/home-assistant/home-assistant/issues/1757#issuecomment-219203535

happyleavesaoc commented 8 years ago

I occasionally get that sort of spam in my logs as well, for MQTT switches: http://hastebin.com/muguxukixe.lua

I don't experience the thread busy issue though.

Luc3as commented 8 years ago

I thouhgt my problem were solved somehow but obviously there is still some bug or something. here is latest log http://hastebin.com/oduqimatuh.lua Still the looping worker pools. I tried to disable each component in config one by one, I ended up with just HA and sun component , problem is still there. Currently I am running HA on RasPi 2B, on fast SD card, and I am using Sun, AsusWRT device tracker, mysensors gateway, and mqtt sensors on mosquitto broker running on RasPi too. I am really sick of this problem because I cannot see what is causing , and sadly I am more and more thinking about trying another Home automation system which is shame because I really love HASS. is there some way of more verbose or debugging mode to turn on or something ?

It is weird at some time, something happen and from this time, sensors are normally connecting to HA, I can see through log or journalctl mqtt messages, mysensors status updates and all, but nothing is written to databse since the time of cause, and this apply for all components in system.

I cannot image using this as live system for home automation, because it is working flawlessly for a while, but later I couldn't switch on lights or heating would not be working or something. I thought about cron restarting home assistant service at night but this cannot be the final solution, if I restart HASS every night or so, the problem can occur at any time , for example in 3 hours and it is broken again. Not to mention about problems that could appear after restarting, for example initial states of lights or something (I don't have configured something like this now so it's only my thoughts), which can lead to flashing ceiling lights on full in bedroom after restart.

balloob commented 8 years ago

In your case it seems like it is the streaming events to the frontend that is causing the problem. Good thing that that is getting a full rewrite for our WSGI stuff

KmanOz commented 8 years ago

@balloob It's very hard to eliminate without losing major functionality of my system rendering it useless. The core components are mqtt & orvibo. With mqtt I have either switches or sensors and 7 or so of the sensors use a template to extract data from a JSON string. I use Owntracks, Rollershutter and some other sensors like Speedtest and CPU speed but that's about it. It's a very basic system compared to what some people are doing out there but is mqtt heavy I guess because everything I do relies on mqtt. I as well can't really rely on it at the moment as well and hope that this issue is somehow fixed. If not I will have to at some stage decide to drop hass but I hope not as I'v spent a lot of hours configuring it. When it works it's just awesome but when it doesn't it's very frustrating to say the least :(

KmanOz commented 8 years ago

Here's another example of what happens. Normally the system goes crazy but it does stabilize and stays that way for X amount of hours until it goes bananas again. When it's stable, I can call the homeassistant/restart service and it will come back stable with logs like this.

http://hastebin.com/cisojaheke.lua

Now if I do call the restart service, sometimes when it reads back all the mqtt topics that have been published with the retain flag, it errors. The attachment below is a good indication of this condition.

alarm_fault

Before the restart 'Server Room PIR' was idle. I do a restart and it reads all other topics but misses "Server Room PIR" and says Unknown. This happens all the time between restarts with different sensors sometimes multiple ones. If I do another restart it gets picked up again and returns to normal. Is this an mqtt component fault that holds the core up considering that @happyleavesaoc is getting spurious mqtt messages as well? I could imagine if a thread gets stuck reading an mqtt subscription that problems could stack up and of course my whole system is mqtt based.

happyleavesaoc commented 8 years ago

The warnings in my log were also from MQTT switches with retain: true.

KmanOz commented 8 years ago

@happyleaves Thanks for that. I removed the retain flag for any switches that I have. HA isn't publishing with retained flag for any switch now. It's a pain but oh well. This is interesting though. This next excerpt is from the HiveMQ documentation on how the retained flag works.

" Also the subscribing client can identify if a received message was a retained message or not, because the broker sends out retained messages with the retained flag still set to true. A client can then decide on how to process the message. "

Based on that, the initialization of the mqtt component would be very important on how HA handles it. I can imagine that when HA first connects to my broker it would get flooded with messages because my various RF gateways that I have for Temp, Wind Speed, Humidity, Door Sensors and RF switches all publish with the retain flag. I can make everything publish without it but I found that it was a pain because every time you restart HA it would essentially start blank. I guess the question is does HA properly handle mqtt messages with the retain flag on especially during startup as it is initializing other components as well and receiving data from one it just initialized.

Anyway I feel sorry @Luc3as because his situation has nothing to do with any of this. His system is as vanilla as it gets and he is having problems.

KmanOz commented 8 years ago

All fixed. My Python install was to blame. I fixed it and problems are gone. I went the hard route, but I should have read and read again before I commented :) The sporadic nature of how it was operating was throwing me and making me think the issues were else where.

Luc3as commented 8 years ago

I am giving it chance, did all the steps you did, and waiting for results now. fingers crossed

KmanOz commented 8 years ago

@Luc3as Actually I do not have it going. Again I didn't leave it long enough to test because to be honest I'm getting jack f it. I have done everything. I have rebuild the whole machine (Debian Jessie) multiple times, tried it on different boxes, but under pressure I am assuming Python will not keep up. And by under pressure I mean a lot of mqtt traffic. The jobs just pile up and the system becomes unresponsive. It's not uncommon for me to see 20 threads busy and 15K jobs waiting. As you can imagine, system is unresponsive, although it will serve web pages just fine :).

Here's the thing though. When I first started to play with HA it was on a Windows machine running Win10 and Python 3.4.2. If I take all the setup from my Linux box that fails and move it to Windows IT WORKS. If it was my setup surely it would crash on Win as well. I spent most of my time on Windows getting comfortable with HA and when I had it worked out I built a Debian box. I moved everything over to it and that's when the problems started. The only thing I haven't done is use Pyenv and create a Python environment mainly because the box won't be used for anything else and I'm not worried about breaking the Python. I thought my issues were that I was installing Python as root, so I changed the install and made my HA user a sudoer. Again I know that's not what the instructions recommend but if you look at the actual HA video on installing HA onto Ubuntu, they so it as a sudoer.

So to sum up, it's 3.4.2 Python on Deb & Win 10 and on Windows it works fine. I don't really want to run this on Windows as it's a waste of a box and an expensive license, and I also find it easier to admin the linux box via SSH.

Can anyone please shed some light. Why would it work in the Win environment but not Linux?

KmanOz commented 8 years ago

So today wanted to eliminate other possibilities. I previously installed Debian Jessie 32bit on the Linux box. I realized the only difference between the Windows box and Linux box was that I was running 32 bit on linux vs 64 bit on Win. So I wiped machine and installed 64 Bit Debian Jessie and 64 Bit Python. Still no work. Did the same with 64 Bit Ubuntu 16.04, did not work. Moved config files to Windows, rock solid.

I just don't get it but I give up. Running on Windows it is.

Luc3as commented 8 years ago

So I tried to completely reinstall whole system , running on latest and updated jessie on raspberry pi 2B+.

after a while the problem occured again, I would said it is even worse after reinstalling.

I really don't understand it, running on windows is not an option for me, I want to run it on raspberry, I am curious if it would be possible to run it on Windows 10 for IoT on new raspberry, don't have one now so cannot try but it would be alternative.

Still don't believe just few of us have this problems , or don't know what we have different than other.

KmanOz commented 8 years ago

Have you tried using the fabric script or are you doing it manually?

Luc3as commented 8 years ago

manually like this https://home-assistant.io/getting-started/installation-raspberry-pi/

KmanOz commented 8 years ago

Why don't you try the fabric script. Then you definitely have eliminated all possibilities for error considering so many people have used the script with success. Not that you're making errors but you know :D

Luc3as commented 8 years ago

deffinitelly going to give it a try this evening , maybe mirracle will happen :D

2016-05-25 9:07 GMT+02:00 Peter Kyrkos notifications@github.com:

Why don't you try the fabric script. Then you definitely have eliminated all possibilities for error considering so many people have used the script with success. Not that you're making errors but you know :D

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/home-assistant/home-assistant/issues/1757#issuecomment-221490850

Luc3as commented 8 years ago

So after two long nights I succesfully installed HASS through fabric script, had problem with internet connection downloading some package and when I rerun the script again it could not continue because users were already created and script did not check for this, and this kind of erros. but I have latest version of HASS running, I put my config to machine and after day or so the very the same problem happened. and as a bonus I am still getting loads of MQTT errors from HASS log Traceback (if applicable):

16-05-29 17:05:35 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 1s
16-05-29 17:05:36 homeassistant.helpers.condition: Value cannot be processed as a number: 
16-05-29 17:07:06 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 1s
16-05-29 17:07:07 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 2s
16-05-29 17:08:38 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 1s
16-05-29 17:08:40 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 2s
16-05-29 17:10:11 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 1s
16-05-29 17:10:12 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 2s
16-05-29 17:11:43 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 1s
16-05-29 17:11:45 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 2s
16-05-29 17:13:16 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 1s
16-05-29 17:13:17 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 2s
16-05-29 17:14:48 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 1s
16-05-29 17:14:50 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 2s
16-05-29 17:16:21 homeassistant.components.mqtt: Disconnected from MQTT (1). Trying to reconnect in 1s

It looks like some problem with writting to database, I am getting real status changes from sensors in UI, but when I open history of some sensor there is just straight line from the moment of error occured. no automation rules are applied and system is showing it on UI but does not do anything else with it.

KmanOz commented 8 years ago

I have given up using Linux. I even tried running my setup on the official Docker image to eliminate any Linux / Python setup issues caused by myself. Same shit. Same issues. Put the config files back onto my Windows box, no issues and runs like a charm. Nothing in my environment changes except for the platform that I run HA on. Mosquitto runs on my NAS so it's always been stand alone HA in my hardware. I know this is a very isolated problem with very few users affected but it seems my particular setup is stressing Linux based HA in some way and it is repeatable every time with any flavor of Linux, version of Python, or Docker images. It may not be a problem worth investigating but I'm sure this issue is going to crop up again. Anyway like I said I have given up trying to isolate the issue.

@Luc3as in your log above, isn't HA & Mosquito broker running on the same Raspberry Pi? How could HA disconnect from a service running on the same hardware LOL. I have only ever seen those errors when I take my Mosquito broker offline by mistake, like when I bumped my NAS plug out of the wall :D I'm not sure what devices you have that talk to your broker but even though you can use the same name & password on all devices, they do have to have different Client ID's or you will get crazy errors.

KmanOz commented 8 years ago

" Groups have a lock so each group will only handle 1 state change of one of it's entities at the same time. Do you have a lot of groups ? Or maybe circular groups (2 groups referring to one another) "

@balloob I have groups WITHIN groups but a single entity only appears in 1 group at a time. I do not have groups referring to one another and I also don't have a single group belonging in 2 different groups. I have groups nested under a main group though. It seems Windows handles this differently to Linux.

default_view: view: yes entities:

Entities only appear in 1 group like group.server but that group only appears once as above

Luc3as commented 8 years ago

I have no groups in my config @KmanOz do you use mysensors network in your setup ? I tried to turn off each components at time again, and now I disabled mysensors network, currently I have only device tracker asus wrt and 2 mqtt sensors and system is running few days without problem. But I dont know if it is because there are no big changes in states and writing to the DB , or if it is just problem with mysensors components.

Luc3as commented 8 years ago

Okay, so as always my thoughts has been little too quick, after 3 days HASS is again completely frozen, and not receiving any new state changes, with just 2 MQTT sensors and asuswrt device tracker.

minida28 commented 8 years ago

Hi, don't mean to hijack this thread, but I think I better post here because my issues are quite similar - if not exactly the same. HA sometimes freezing for minutes but it will catchup. When it "woke up", all MQTT sensors values kept updating very fast, but with old sensor values. After couple of minutes or so, it will stabilize and then run normally. The other problem is when I try to restart HA, it will freeze. The only way to kill it is with sudo kill -9 command. I have to use the -9 option because standard sudo kill command just won't work.

My config contains of groups and dozens of MQTT stuffs (switches and sensors) and updating once for every second.

I momentarily "workaround" the above problems by turning off / comment out these features:

- history - logbook - recorder

Off course this is not a solution, but at least HA is now very responsive and I can restart it with brisk and successfully every time I want it (either using the built-in command in the Developer Tools or with command line in terminal).

I hope we can find the solutions to these problems.

I am using an old laptop (Core 2 Duo) with 2GB of RAM. OS is Ubuntu Server 16.04 LTS. HA version 0.20.3; installed using virtualenv way, as depicted in the tutorial.

Thanks, HA is superb project !

KmanOz commented 8 years ago

@minida28 please..... the more info the better. You are definitely not hijacking this thread. It seems we have fairly similar systems. Everything I have in terms of switches and sensors are all mqtt based. PIR's, weather / wind / temperature etc. All weather sensors update every minute for Temp/Humidity and because I have Oregon Scientific weather sensors outside, I get updates whenever they decide to transmit them. I have Owntracks (mqtt) and 8 Orvibo switches. Although I did have issues with them at the start (v12.x or something) they are no longer a problem when the new Orvibo script was released. The way you described your issues are exactly the issues that I am seeing, EXACTLY. I have tried everything. different flavors of Linux, different computers, running the Docker version of the software in case I built the machine wrong I even put an SSD into a linux box (2.4Ghz Core 2 Duo, 1 G RAM) and no difference. Threads get stuck and when they do the jobs just pile up. @balloob suggested it may be groups so last night I completely removed groups and here is the log today. I wont bore you with the details but just what happened over time.

16-06-06 08:55:08 homeassistant.core: WorkerPool:All 19 threads are busy and 229 jobs pending 16-06-06 08:55:13 homeassistant.core: WorkerPool:All 19 threads are busy and 457 jobs pending
16-06-06 08:55:32 homeassistant.core: WorkerPool:All 19 threads are busy and 913 jobs pending 16-06-06 08:56:09 homeassistant.core: WorkerPool:All 19 threads are busy and 1825 jobs pending
16-06-06 08:57:11 homeassistant.core: WorkerPool:All 19 threads are busy and 3649 jobs pending 16-06-06 09:00:05 homeassistant.core: WorkerPool:All 19 threads are busy and 7297 jobs pending

16-06-06 09:04:34 homeassistant.core: WorkerPool:All 19 threads are busy and 14593 jobs pending

That was almost 10 minutes but it can get much worse. When that happens, forget about killing HA, I just pull the plug in frustration. If I wait, and if it comes back all sensors go nuts updating, and any automation's based on those sensors fire off and it's quite amusing (read frustrating) to watch it all happen.

I have been trying to figure this out for 1 month now. Let me say the thing that has really bothered me is that it can be very stable for a day or two which makes me think I have found the issue then out of the blue, goes completely crazy. I am going to try what you did above and see how it goes.

@Luc3as Try turning off history etc and lets see what happens. I will be back with my findings :dancer:

KmanOz commented 8 years ago

@Luc3as Sorry.. just read your question above. No I do not have mysensors network. All my switches / sensors etc are custom ones I have created myself except for 8 X Orvibo switches.

And yes I have had the system stable for 2 or 3 days too. It makes you think you have solved the problem and then BANG same problem :D

philhawthorne commented 8 years ago

Hey guys,

Just going to throw my own experience in here too. I am running the docker image on a Synology NAS. I have owntracks setup with MQTT, and I sometimes get these issues too.

I'll see the same worker threads busy in my logs. HA will be laggy. For example, I'll walk into a room and trigger a Z-wave multisensor. HA should then tell my Hue lights to turn on. Most of the time its instant, but when I'm experiencing the same as above, it can be a good minute (after I've left the room) before the lights turn on. Same goes for manually turning Hue and Zwave devices on/off from the HA UI. Toggling a switch in the UI will send the command to the device delayed.

KmanOz commented 8 years ago

@philhawthorne Hey bud. Welcome aboard. I notice you're in Melb. Great part of the world isn't it. Yes you issues are exactly what I am talking about and have similar symptoms.

Well I am glad people are coming here and talking about this. Further to my test I can confirm that it isn't the removal of

that's causing a problem for me. I removed them and re started HA and I have the same issues.

I can create the issue on demand here. Here's how I do it. I have modified the Owntracks .otrc file and have made Owntracks update far more frequently when in Move Mode. That means a LOT more battery usage, but when I am mobile in the car I don't care because the phone is on constant charge. I wanted that granularity anyway because my plans are to use Owntracks recorder in the background and keep a track of where I go. The result though is that my broker receives string every 10 seconds from Owntracks which means HA also sees that string. When HA is working properly, you see my icon on the HA map move in real time as I am traveling. I know that HA has fallen behind when the position on the map doesn't match where I actually am. I monitor all Broker traffic with mqtt-spy and can confirm that the Broker is working perfectly, my sensors and switches are working perfectly and all messages in and out of the Broker are just fine. It's just HA that has frozen or is dropping mqtt messages because it is stuck somewhere else doing whatever.

Anyway I have been home now for 15 minutes and HA just updated my position in the UI to a zone I was in 20 minutes ago. In fact this time it has frozen for good and isn't coming back. I need to do a complete restart, and I usually just kill the Docker container.

As far as I am concerned this is a major fault with their threading engine or however the threads are handled in HA. I have disabled almost everything to a point that it's no point using HA like that. I hope someone chimes in soon and has a solution otherwise as much as I like this system, I am almost at the point of scrapping it and the work I have put into it.

I can say this. Windows handles this MUCH better than Linux. I don't see a lot of issues in Windows but they still do happen. I love this system for it's openness, it's ease of setup and features but the core of the system needs to be rock solid before feature apon feature is added on. It use to be stable but that was many versions ago and almost impossible to test because not only do versions change quickly but syntax changes as well and I cannot be bothered trying to resurrect a system that was working 3 months ago and having to rewrite automation, scripts etc to suit previous versions. I'll go back to my Micro controller based system that didn't have a fancy UI but worked ROCK SOLID for ever. I'll just add Homebridge support to it and control the system by voice commands.

Anyway super frustrated.

minida28 commented 8 years ago

Found this in the HA Community forum; the OP also describe similar issue: https://community.home-assistant.io/t/restart-stop-home-assistant-systemd/354

KmanOz commented 8 years ago

Guys install 0.21.0. Just like that 1 month of checking configs, re-checking configs, re-checking installs of OS's, testing different machines etc have disappeared. Seems the title of this thread wasn't that far off the mark. According to the Blog BIG core improvements in 0.21.0. Everything working as expected for me. Thanks @balloob

minida28 commented 8 years ago

Yeah agreed, thanks to @balloob and @JshWright for the latest iteration. Me too just updated to 0.21.0 since yesterday and I can feel improvements right away. I made no change to config file except now I turn on history, logbook and recorder back. Restarting HA still take about 1.5 minute on my machine - but that is lot of improvement compare to previous version (previously HA won't restart and could only kill the instance with sudo kill -9).

Luc3as commented 8 years ago

yup super work, I can confirm running without restarting service few days and it is working like a charm, there is still some issue with updating graph data but it is on other issue . thanks for support