Closed KennethLavrsen closed 5 years ago
Follow up. After 15 reboot I am in the situation that the stick is working again but all my sensors are gone and do not come back. So again I am depending on a snapshot to get the system back. This is HORRIBLE
I've seen things like this in the past when hardware was the culprit. Is the ZWave stick connected directly to the NUC or through some kind of extender/hub?
Since you've got docker setup, grab the openzwave control panel image (https://hub.docker.com/r/openzwave/openzwave-control-panel) and see how things initialize. You can't run both the control panel and hass at the same time, but the underlying library is the same.
I run in docker following step by step but I have no clue what is happening with docker so it is a learning curve before I can run dockers. I will give it a try but may ask for help on Discord.
I did the installation but I need to learn how to start and stop it. I run the Aeotec stick without any hubs directly on the NUC with a 1 m extension cable of a good quality. I have a lot of hardware test equipment so I have checked the voltage is correct at the end of the cable. I have tried two different cables. I use the extention to get a little distance between the RF noisy computer and the stick and that has helped a lot on the range. It cannot the the signal integrity of the serial signal. It is slow and if I have bit errors then I will see that in the logs all the time. This is a transitional thing that only happens at restart of HA. Note that this happens both cold and warm restarts where only HA restarts. I think HA (or OZW) does not correctly init or close the serial device. I think the stick is left in a bad mode because it always cures itself after some seconds and the errors stop in the log. But unfortunately HA is not very robust and does not recover a bad start of the OZW. Once failed and the device never come up
Could this be an issue with the NUC? Either it's powering down the usb ports when there's no activity or something like that? I only ask because someone mention the NUC in one of those issues you linked. I did a little searching and found this thread talking about the NUC leaving connected USB devices in a weird state upon shutdown. Just a thought, I could be way off 😄
Anecdotal, but I've been running the same USB stick on HA since 0.33 without issue (at least, no issues like this). I also have a usb cord as an extension, similar in length to what you mentioned. Back then I was running with the AIO installer on a raspberry pi 3, now I'm running in a docker container on a home server.
First thing I would try, outside of checking this with openzwave control panel, is running the stick without an extension cable.
Had this on my NUC too, moving the stick to a front USB port solved the problem...
Will try that too. I also upgraded my BIOS. I had one incident. Problem is stats. When it fails it fails maybe 4-6 resets in a row. And then it runs fine for a week and then boom you have the problem again. It seems to happen more often when you restart HA many times in connection with adding things to the config. Could it be related to restarting while the stick is doing a lot of read and writes like during a restart? I rarely if ever saw a problem with a single restart. I remember it as something it does when I am working on a config change and restart many times. Could that leave the stick in limbo? With respect to front vs back. The BIOS allows to enable higher charge currents on selected ports. I enabled it on all. I will try the opposite. I know the extension is not the issue because I tried without and I have measured that the voltage is OK at the other end. I have all the right equipment for that (hardware engineer)
OK I have upgraded the NUC BIOS I have put the Aeotec stick in a front connector yesterday and restart HA many times last night. The first thing that happens after the restart in connection with 0.88.1 upgrade it all fails again with the same error message as earlier It took 4 reboots incl a cold power down and up again to get Z-wave back working. I do not believe in any NUC or hardware related issues. There is something wrong with the way HA (or OZW starts and/or shuts down that stick. And it is important to note the behaviour where it take not one but multiple restarts to get it working again. And then it can work for a week.
And it never ever fails once it is running. So communication errors on the serial connection is out of the question because they would happen randomly. This is either an init problem or a shutdown problem or linked to starting/stopping while communication is going on. That would explain why it is so difficult to restart because that is where there is tons of communication with the stick for quite a long time while the network builds up.
The only thing I find strange is that the issue can be persistant even after a cold boot where I can see the light in the stick go off (USB power off) and on again. But we have to remember that the stick is battery powered so mayby it starts emitting old data the minute there is someone that listens. Maybe it is data pending to be sent from earlier activities from the stick that HA cannot handle? If you note the logs then the errors is never missing communication but garbage data
I have same issue, this after it has for some reasons swapped ACM0 and ACM1. I tried multiple reboots, restarts... not better...
Any other idea?
I could solve my problem by changing the path to .storage/core.config_entries
Just a note to this does not get closed for te wrong reason. The bug I have reported as nothing to do with wrong path to the device. I have confirmed that the device is the /dev/ttyACM0 as setup. I check that every time I see the failure mode. It is only when you either add additional sticks (like a Zigbee sniffer) or pull out the stick and plug it back that the device name changes. And if this happens randomly then you either have to specify by device ID or create a udev rule to set fixed device names. It is easy to find how to on Google.
I just say this to emphasize that this bug item is not resolved by changing device path in a file. This bug report relates to the issue with HA is failing on garbage data received from stick during boot and restart as I have described it above.
I keep on working on finding more data to aid the developers. If we say it is the stick flushing old data (which will be hard to fix if it is a bug in the Aeotec) a work around may be if HA in case of failure waits a while and makes a new attempt to start all over with the init. Since this only happens at restart/boot it would not have to retry forever. Just once may be enough.
@KennethLavrsen can you try waiting for Home Assistant to fully restart before starting the zwave network? https://www.pulse-station.com/ufaqs/zwave-logs-warning-out-of-frame-flow-sending-nak/ explains a situation where the stick is not ready to accept commands and can lock up the stick. Perhaps we need a way to prevent zwave network from starting automatically and then do a delayed start like some of the other components
The past week I have not seen the problem but that is also because I restart less frequently. But I will like to try next time the situation happens. But please explain - how do I prevent the zwave network from starting by itself at restart? I cannot see any configs for that
Looks like we don't have the ability to do a delayed start for zwave. Maybe this is something we can get added.
Not sure if this might help. Homekit has a semi-similar problem. If a delay after HA restart is the problem perhaps an automation like this may help:
Two things. First having this automation .. will that prevent the start of the stick too early? Or just restart the service?
Second, is there a service similar to home kit that can start or restart the zwave network? A previous reply indicates that there is not.
On your second question, zwave DOES have a zwave.start
and zwave.stop
service, although I have never personally used them do can't say if they work or not. I don't think those services would help you delay a zwave start, but maybe it's worth trying?
As far I know, we still need a way to delay the start of it. We have the ability to start/stop but need a way to delay.
On Fri, Mar 22, 2019 at 7:24 AM Kevin Fronczak notifications@github.com wrote:
On your second question, zwave DOES have a zwave.start and zwave.stop service, although I have never personally used them do can't say if they work or not. I don't think those services would help you delay a zwave start, but maybe it's worth trying?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/home-assistant/home-assistant/issues/21165#issuecomment-475584431, or mute the thread https://github.com/notifications/unsubscribe-auth/ABATkBkPSqTc6ttns937MPbA_1s47RG-ks5vZL2FgaJpZM4bASEN .
I have a setup with a NUC running hassos and the Aeotec Gen5 stick. No issues so far but we will be adding about 30 devices to this setup over the next week. I'll try to see if this changes the behaviour.
I have exactly the same problem on a similar hardware configuration and installation. It seems to be an issue in the shutting down of HA that causes the issue when it restarts.
I've found that the following manipulation is solving the problem:
Can confirm same problem on HASSIO Docker image v0.92.2 on Ubuntu NUC. Same Aeotec gen5 zstick USB + zwave network. Was fine on rpi3b+, moved to nuc and now every few reboots (or restarts of HA via General menu) results in zwave startup issues.
I've got a ASRock KBL-NUC that is exhibiting a similar problem with an Aeotec Gen5 Stick (Ubuntu base with hassio). Except it's more like 1 out of 10 starts results in a working ZWave system. Powering down the whole system, or just removing the ZWave Stick seems to help a bit, but it's not a real fix. This has been a problem for several versions now, and is still a problem with 0.93.1.
On the plus side, once it's working it works well until the next time HA restarts. It just makes me dread restarting HA for any reason.
I'm running an Intel-NUC NUC5CPYH, ubuntu 19.04, docker 18.09.5, hassio 0.93.1 also with the Aeotech gen 5 stick. Having the same experience with Zwave after migrating from Rpi.
From my troubleshooting I have seen that even when the Zwave nodes are not showing up, activity shows in the zwave logs. Occasionally there are no nodes present under Z-Wave Network Management, other times there is only Node1 with an unknown status.
Some of the time I am able to stop and restart the z wave network from within Z-wave management page, other times each button responds with a failed message.
I have tried rebooting from command line, gui and portainer, and none seem to have a consistant working z-wave network.
Also, once it does boot correctly with z-wave, no issues until a reboot then 50/50.
Wondering if the migration of the network is common with others having this issue, or is your z-wave network built while running on your current hardware?
If a migration from other hardware is the common trend, then I plan to delete zwcfgxxxx and start factory resetting and re adding nodes fresh.
edit; I did another reboot and under ZWNM only node1:unknown(complete) appeared in the list, but I was able to stop the network, and then start the network with all nodes showing up correctly without having to reboot HA.
This issue was resolved for me by updating the NUC to latest BIOS version. Last thing I would have expected to be the cause!
as mentioned in discord - "Out of Frame Flow messages" are 99.99999% of times this is caused by another program accessing the same serial device as OZW (/dev/ttyUSB0 etc). I believe there was some terminal addon in Hass.io that conflicted with this previously.
You can try running lsof <path to USB device>
in the container/host etc of your gateway to see what else is opening the port.
Some previous users have reported Getty (Serial Console), NUT (UPS monitoring daemon) to be other applications/services running that. I've even seen some users restart HASS, and the old copy of HASS is still shutting down when the new copy is starting up, causing two copies OZW (for a small period of time) accessing the same serial port.
also - due to symlinks etc in the /dev/ tree - lsof /dev/ttyUSB0 etc might not always return everything. you should also try 'lsof | grep /dev/' to see help filter. (eg, if you use the /dev/serial/by-id/
I doubt that it is another program that access the serial because why would that only happen when a docker container with hass.io restarts? I would expect sporatic errors randomly if there is some UPS service playing along. I could also expect something like this when you boot the machine but this is happening both at cold boot and at restart of HA. I am more thinking that it could be HA that closes down the serial port and then start up again before the old communication is over.
Someone said that the Aeotec stick is know to continue sending data when the z-wave net is active.
The probability that the error occurs is much higher when you restart many times at short interval (the typical situation where you try to get a new thing setup in configs). In this situation the Z-wave net is restarted again and again and will be talking to the stick while HA is restarting again.
I think the HA devs should try and make a trial with a delayed start of Z-wave including delaying access to the serial to allow the old HA instance to die and the communication on the stick to die. It would also be good if HA would retry a minute later if the init fails.
Two instances of OZW could explain why I have seen multiple failures that I could not recover from without restoring the zwcfg... file from backup. I have had several situations where many many restarts and cold reboots never recovered and I had to resort to a snapshot. For the same reason I make many snapshots when I change something
just do sudo lsof | grep USB
and check. Bet you have ModemManager hogging the port from time to time. ZHA had same issues.
I will happily place money that the issue is something else is opening/opened the serial port when you get these Out Of Frame Messages. it may be a old copy/container of hassio shutting down, it may be something else... but every single time I have seen it (hundreds maybe?) its been duplicate copies of a application, or a 3rd party application.
as far as your question goes about only happening upon restarts? Some applications will try, and error out as well, so it happens upon "booting" the container and the other program errors out (because its not talking to what it thinks it is) and exit.
If your handy with Linux, something like
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }'
running when you restart your containers should show you what is opening what on your system.
(run that on your host, not in the container)
I gave up on Hassio and Z-wave and moved to Zwave2MQTT so I cannot test any more
I did try the lsof before moving my zwave to zwave2mqtt and it showed a long list of hassio process number but never any third party stuff
It would be strange if the situation was that every time a hassio docker instance restarts some 3rd party process would wake up and talk on the serial and then never again. A third party program from the host OS would either do it at host boot or randomly all the time. This only happens when hassio is restarted.
Let us wrap up the facts
Remember that the Aeotec is a battery powered device that seems to have some life even when unplugged and it may start talking on the serial the minute the serial is up. There may not even be any flow control
My theory is that the stick continues to talk on the serial during restarts. Maybe not for long but long enough to confuse the new instance of hassio. Additionally I think hassio may not close down OZW very nicely. Maybe the communication with stick is not shut down correctly. Maybe that is also what causes a corrupt OZW data file.
There is something hassio does not do right with open zwave. And it seems the Raspberry Pi is simply so slow that it does not matter. Or the USB chipset on Raspberry handles things differently when a process dies.
I would look at how hassio shuts down open zwave and how it handles failure modes during startup.
I would also look for multiple hass.io processes trying to talk to the Aeotec via OZW at the same time. And again timing seems to be important since the slower raspberry does not show the problem
@KennethLavrsen yes, it happens after restarts because hassio HA container is triggering udev, which in turn may trigger other applications like Ubuntu's Modem manager.
Hmm. That is a good idea. Unfortunately I no longer have a HA with an Aeotec stick but if someone still having the HW and the problem could try this. I assume the thing to do is to disable modem manager in Ubuntu host. I cannot test these but here is what I found on Google search
https://linux-tips.com/t/prevent-modem-manager-to-capture-usb-serial-devices/284
https://www.freedesktop.org/software/ModemManager/api/latest/ref-overview-modem-filter.html
https://stackoverflow.com/questions/24696527/modem-manager-and-ttyacm-in-use
There are plenty more. If this is the root cause then there is still a todo for HA because it should simply wait and try again if it fails like it does with so many other devices. The most primitive work around would be to wait one minute and try again.
Hi. So I have now tried this: https://linux-tips.com/t/prevent-modem-manager-to-capture-usb-serial-devices/284
(https://askubuntu.com/questions/399263/udev-rules-seem-ignored-can-not-prevent-modem-manager-from-grabbing-device) I believe this is really the way to go..
But what I ended up doing was systemctl disable ModemManager.service
Have now restarted NUC 2 times, and many restarts of Hassio. Seem to work fine. Will continue testing it
@dgooglr found the solution! I had this problems from start with zwave on NUC. Now I have restared about 20x HA, rebooted NUC 10x, power off on NUC 5x, fast restart of HA-without wait to zwave ready 15x, and zwave works! This reboot causes before the lost zwave! HA also load faster now, very fast! And no more slow start of zwave!
I decided to put my zwave back on HA. I need to restart many many more times or a period of time to confirm the fix. So far it looks good.
The fix was from the host Ubuntu to diable modem manager with systemctl disable ModemManager and then just reboot the whole machine
I will report back. If this works out the short term is to get this in the Zwave documentation and hopefully later someone can improve HA so it retries after a minute if the startup of Zwave fails. It was @Adminiuga saying the word ModemManager that made this breakthrough
We have the fix now! Finnaly!
Since I disabled Modem Manager I have not had a single event of startup problems after boot or restart.
The easy work around is to disable Modem Manager.
systemctl disable ModemManager.service
The documentation for HA has had this addressed now but the documentation suggests to purge the ModemManager from the system. This however can have negative effects on distros with Gnome desktops so best advice is to disable the service.
I still believe HA should be coded to take an approach that allows people that may need ModemManager or simply have not seen or understood the documentation. Modem Manager only runs for a few seconds so HA could simply retry maybe 30 seconds after a failed init and the user would be happily ignorant about the issue. There are so many other integrations that have mechanisms to reconnect when the link fails. That would be the best solution. But at least with the documentation update people can now use the Aeotec Z stick.
Thanks people for figurng this out. Something funny: I had deconz installed and accidently used the same dev path for the deconz stick which provoked my error messages .. Now with disabled modemManager and correct configuration of my deconz USB0 device everyting is back to normal :P
So the documentation was fixed in: https://github.com/home-assistant/home-assistant.io/pull/9613 for Debian/Ubuntu based systems, and in https://github.com/home-assistant/home-assistant.io/pull/9638 for Hass.io, Docker.
Imho this issue should be closed.
Yes. I am OK with that.
Hi guys, i've got similar issue with Raspberry 3. However there is no ModemManager installed. Is there any other link or solution? Thanks for your help. regards ondrej
Home Assistant release with the issue: 0.87.1
Last working Home Assistant release (if known): Been unstable since I started using HA a month ago on 0.86
Operating environment (Hass.io/Docker/Windows/etc.): Hass.io under docker in Ubuntu host on NUC
Component/platform: Zwave
Description of problem: When I restart HA I have a 50/50 chance/risk that Z-wave is not comming up. And it is often many boots in a row and seems to be worse when you do it often
The USB device is there on ttyACM0. I check everytime. It is there and visible in the container. I am using an Aeotec USB stick.
It is always at startup it fails. If it comes up it is stable. I have checked USB voltage. I have checked the stick on a Windows machine. There is nothing wrong with the stick. It is HA that cannot initialize it in a reliable way (or OZW).
Problem-relevant
configuration.yaml
entries and (fill out even if it seems unimportant):Traceback (if applicable): When it fails the OZW log always contains something like this
or this
Not always the same. It does not matter that the communication seems to continue without hickups. Once failed HA refuses to finish the Z-Wave setup and you end up with no devices or only the Z-wave stick as a single device. And then you can restart and reboot and suddenly it comes back. And then you may boot or restart HA 10 times without a problem. There is a timing issue or an issue with not closing the comm port. The device contains a battery so it may be left in limbo mode. I cannot be sure.
This makes HA so unreliable that I cannot trust it to guard my house. I have seen this issue raise several times on the bug tracker - like here https://github.com/home-assistant/home-assistant/issues/15828 (closed without a fix) - and a very old one https://github.com/home-assistant/home-assistant/issues/5501 (still open) with an error mode which is similar but not the same
I add 4 OZW logs. The ones with ERR are where HA would not start Z-wave and the OK ones are where everything works.
I have once been in a situation where 20 restarts did not recover and I had to reinstall a snapshot from a few days back which revived the Z-wave again. So it can fail in a pretty serious way.
OZW_Log1902180657ERR.txt OZW_Log1902131926OK.txt OZW_Log1902131922ERR.txt OZW_Log1902131913OK.txt
Additional information: