home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.96k stars 31.01k forks source link

Z-wave Aeotec stick is not started up correctly. #21165

Closed KennethLavrsen closed 5 years ago

KennethLavrsen commented 5 years ago

Home Assistant release with the issue: 0.87.1

Last working Home Assistant release (if known): Been unstable since I started using HA a month ago on 0.86

Operating environment (Hass.io/Docker/Windows/etc.): Hass.io under docker in Ubuntu host on NUC

Component/platform: Zwave

Description of problem: When I restart HA I have a 50/50 chance/risk that Z-wave is not comming up. And it is often many boots in a row and seems to be worse when you do it often

The USB device is there on ttyACM0. I check everytime. It is there and visible in the container. I am using an Aeotec USB stick.

It is always at startup it fails. If it comes up it is stable. I have checked USB voltage. I have checked the stick on a Windows machine. There is nothing wrong with the stick. It is HA that cannot initialize it in a reliable way (or OZW).

Problem-relevant configuration.yaml entries and (fill out even if it seems unimportant):

Traceback (if applicable): When it fails the OZW log always contains something like this

2019-02-18 08:27:51.765 Warning, WARNING: 500ms passed without reading the rest of the frame...aborting frame read
2019-02-18 08:27:51.765 Warning, WARNING: Out of frame flow! (0x02).  Sending NAK.
2019-02-18 08:27:53.267 Warning, WARNING: 500ms passed without reading the rest of the frame...aborting frame read
2019-02-18 08:27:53.267 Warning, WARNING: Out of frame flow! (0x02).  Sending NAK.
2019-02-18 08:27:53.767 Warning, WARNING: 500ms passed without reading the rest of the frame...aborting frame read
2019-02-18 08:27:53.767 Detail, contrlr,   Received: 0x01, 0x02, 0x08, 0x1d
2019-02-18 08:27:53.767 Warning, contrlr, WARNING: Checksum incorrect - sending NAK
2019-02-18 08:28:01.166 Error, contrlr, ERROR: Dropping command, expected response not received after 1 attempt(s)

or this

2019-02-18 06:55:16.690 Info, contrlr, Sending (Command) message (Callback ID=0x00, Expected Reply=0x15) - FUNC_ID_ZW_GET_VERSION: 0x01, 0x03, 0x00, 0x15, 0xe9
2019-02-18 06:55:17.682 Detail, Unsolicited message received while waiting for ACK.
2019-02-18 06:55:18.183 Warning, WARNING: 500ms passed without reading the rest of the frame...aborting frame read
2019-02-18 06:55:18.183 Warning, contrlr, WARNING: NAK received...triggering resend
2019-02-18 06:55:18.183 Error, contrlr, ERROR: Dropping command, expected response not received after 1 attempt(s)
2019-02-18 06:55:18.183 Detail, contrlr, Removing current message
2019-02-18 06:55:18.183 Warning, WARNING: Out of frame flow! (0x5a).  Sending NAK. 

Not always the same. It does not matter that the communication seems to continue without hickups. Once failed HA refuses to finish the Z-Wave setup and you end up with no devices or only the Z-wave stick as a single device. And then you can restart and reboot and suddenly it comes back. And then you may boot or restart HA 10 times without a problem. There is a timing issue or an issue with not closing the comm port. The device contains a battery so it may be left in limbo mode. I cannot be sure.

This makes HA so unreliable that I cannot trust it to guard my house. I have seen this issue raise several times on the bug tracker - like here https://github.com/home-assistant/home-assistant/issues/15828 (closed without a fix) - and a very old one https://github.com/home-assistant/home-assistant/issues/5501 (still open) with an error mode which is similar but not the same

I add 4 OZW logs. The ones with ERR are where HA would not start Z-wave and the OK ones are where everything works.

I have once been in a situation where 20 restarts did not recover and I had to reinstall a snapshot from a few days back which revived the Z-wave again. So it can fail in a pretty serious way.

OZW_Log1902180657ERR.txt OZW_Log1902131926OK.txt OZW_Log1902131922ERR.txt OZW_Log1902131913OK.txt

Additional information:

KennethLavrsen commented 5 years ago

Follow up. After 15 reboot I am in the situation that the stick is working again but all my sensors are gone and do not come back. So again I am depending on a snapshot to get the system back. This is HORRIBLE

mezz64 commented 5 years ago

I've seen things like this in the past when hardware was the culprit. Is the ZWave stick connected directly to the NUC or through some kind of extender/hub?

Since you've got docker setup, grab the openzwave control panel image (https://hub.docker.com/r/openzwave/openzwave-control-panel) and see how things initialize. You can't run both the control panel and hass at the same time, but the underlying library is the same.

KennethLavrsen commented 5 years ago

I run in docker following step by step but I have no clue what is happening with docker so it is a learning curve before I can run dockers. I will give it a try but may ask for help on Discord.

I did the installation but I need to learn how to start and stop it. I run the Aeotec stick without any hubs directly on the NUC with a 1 m extension cable of a good quality. I have a lot of hardware test equipment so I have checked the voltage is correct at the end of the cable. I have tried two different cables. I use the extention to get a little distance between the RF noisy computer and the stick and that has helped a lot on the range. It cannot the the signal integrity of the serial signal. It is slow and if I have bit errors then I will see that in the logs all the time. This is a transitional thing that only happens at restart of HA. Note that this happens both cold and warm restarts where only HA restarts. I think HA (or OZW) does not correctly init or close the serial device. I think the stick is left in a bad mode because it always cures itself after some seconds and the errors stop in the log. But unfortunately HA is not very robust and does not recover a bad start of the OZW. Once failed and the device never come up

fronzbot commented 5 years ago

Could this be an issue with the NUC? Either it's powering down the usb ports when there's no activity or something like that? I only ask because someone mention the NUC in one of those issues you linked. I did a little searching and found this thread talking about the NUC leaving connected USB devices in a weird state upon shutdown. Just a thought, I could be way off 😄

Anecdotal, but I've been running the same USB stick on HA since 0.33 without issue (at least, no issues like this). I also have a usb cord as an extension, similar in length to what you mentioned. Back then I was running with the AIO installer on a raspberry pi 3, now I'm running in a docker container on a home server.

mezz64 commented 5 years ago

First thing I would try, outside of checking this with openzwave control panel, is running the stick without an extension cable.

SNoof85 commented 5 years ago

Had this on my NUC too, moving the stick to a front USB port solved the problem...

KennethLavrsen commented 5 years ago

Will try that too. I also upgraded my BIOS. I had one incident. Problem is stats. When it fails it fails maybe 4-6 resets in a row. And then it runs fine for a week and then boom you have the problem again. It seems to happen more often when you restart HA many times in connection with adding things to the config. Could it be related to restarting while the stick is doing a lot of read and writes like during a restart? I rarely if ever saw a problem with a single restart. I remember it as something it does when I am working on a config change and restart many times. Could that leave the stick in limbo? With respect to front vs back. The BIOS allows to enable higher charge currents on selected ports. I enabled it on all. I will try the opposite. I know the extension is not the issue because I tried without and I have measured that the voltage is OK at the other end. I have all the right equipment for that (hardware engineer)

KennethLavrsen commented 5 years ago

OK I have upgraded the NUC BIOS I have put the Aeotec stick in a front connector yesterday and restart HA many times last night. The first thing that happens after the restart in connection with 0.88.1 upgrade it all fails again with the same error message as earlier It took 4 reboots incl a cold power down and up again to get Z-wave back working. I do not believe in any NUC or hardware related issues. There is something wrong with the way HA (or OZW starts and/or shuts down that stick. And it is important to note the behaviour where it take not one but multiple restarts to get it working again. And then it can work for a week.

And it never ever fails once it is running. So communication errors on the serial connection is out of the question because they would happen randomly. This is either an init problem or a shutdown problem or linked to starting/stopping while communication is going on. That would explain why it is so difficult to restart because that is where there is tons of communication with the stick for quite a long time while the network builds up.

The only thing I find strange is that the issue can be persistant even after a cold boot where I can see the light in the stick go off (USB power off) and on again. But we have to remember that the stick is battery powered so mayby it starts emitting old data the minute there is someone that listens. Maybe it is data pending to be sent from earlier activities from the stick that HA cannot handle? If you note the logs then the errors is never missing communication but garbage data

daniwebCH commented 5 years ago

I have same issue, this after it has for some reasons swapped ACM0 and ACM1. I tried multiple reboots, restarts... not better...

Any other idea?

daniwebCH commented 5 years ago

I could solve my problem by changing the path to .storage/core.config_entries

KennethLavrsen commented 5 years ago

Just a note to this does not get closed for te wrong reason. The bug I have reported as nothing to do with wrong path to the device. I have confirmed that the device is the /dev/ttyACM0 as setup. I check that every time I see the failure mode. It is only when you either add additional sticks (like a Zigbee sniffer) or pull out the stick and plug it back that the device name changes. And if this happens randomly then you either have to specify by device ID or create a udev rule to set fixed device names. It is easy to find how to on Google.

I just say this to emphasize that this bug item is not resolved by changing device path in a file. This bug report relates to the issue with HA is failing on garbage data received from stick during boot and restart as I have described it above.

I keep on working on finding more data to aid the developers. If we say it is the stick flushing old data (which will be hard to fix if it is a bug in the Aeotec) a work around may be if HA in case of failure waits a while and makes a new attempt to start all over with the init. Since this only happens at restart/boot it would not have to retry forever. Just once may be enough.

uchagani commented 5 years ago

@KennethLavrsen can you try waiting for Home Assistant to fully restart before starting the zwave network? https://www.pulse-station.com/ufaqs/zwave-logs-warning-out-of-frame-flow-sending-nak/ explains a situation where the stick is not ready to accept commands and can lock up the stick. Perhaps we need a way to prevent zwave network from starting automatically and then do a delayed start like some of the other components

KennethLavrsen commented 5 years ago

The past week I have not seen the problem but that is also because I restart less frequently. But I will like to try next time the situation happens. But please explain - how do I prevent the zwave network from starting by itself at restart? I cannot see any configs for that

uchagani commented 5 years ago

Looks like we don't have the ability to do a delayed start for zwave. Maybe this is something we can get added.

Snipercaine commented 5 years ago

Not sure if this might help. Homekit has a semi-similar problem. If a delay after HA restart is the problem perhaps an automation like this may help:

KennethLavrsen commented 5 years ago

Two things. First having this automation .. will that prevent the start of the stick too early? Or just restart the service?

Second, is there a service similar to home kit that can start or restart the zwave network? A previous reply indicates that there is not.

fronzbot commented 5 years ago

On your second question, zwave DOES have a zwave.start and zwave.stop service, although I have never personally used them do can't say if they work or not. I don't think those services would help you delay a zwave start, but maybe it's worth trying?

uchagani commented 5 years ago

As far I know, we still need a way to delay the start of it. We have the ability to start/stop but need a way to delay.

On Fri, Mar 22, 2019 at 7:24 AM Kevin Fronczak notifications@github.com wrote:

On your second question, zwave DOES have a zwave.start and zwave.stop service, although I have never personally used them do can't say if they work or not. I don't think those services would help you delay a zwave start, but maybe it's worth trying?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/home-assistant/home-assistant/issues/21165#issuecomment-475584431, or mute the thread https://github.com/notifications/unsubscribe-auth/ABATkBkPSqTc6ttns937MPbA_1s47RG-ks5vZL2FgaJpZM4bASEN .

olbjan commented 5 years ago

I have a setup with a NUC running hassos and the Aeotec Gen5 stick. No issues so far but we will be adding about 30 devices to this setup over the next week. I'll try to see if this changes the behaviour.

snowflakeHA commented 5 years ago

I have exactly the same problem on a similar hardware configuration and installation. It seems to be an issue in the shutting down of HA that causes the issue when it restarts.

I've found that the following manipulation is solving the problem:

geoffreys commented 5 years ago

Can confirm same problem on HASSIO Docker image v0.92.2 on Ubuntu NUC. Same Aeotec gen5 zstick USB + zwave network. Was fine on rpi3b+, moved to nuc and now every few reboots (or restarts of HA via General menu) results in zwave startup issues.

iwoloschin commented 5 years ago

I've got a ASRock KBL-NUC that is exhibiting a similar problem with an Aeotec Gen5 Stick (Ubuntu base with hassio). Except it's more like 1 out of 10 starts results in a working ZWave system. Powering down the whole system, or just removing the ZWave Stick seems to help a bit, but it's not a real fix. This has been a problem for several versions now, and is still a problem with 0.93.1.

On the plus side, once it's working it works well until the next time HA restarts. It just makes me dread restarting HA for any reason.

Twisted7ech commented 5 years ago

I'm running an Intel-NUC NUC5CPYH, ubuntu 19.04, docker 18.09.5, hassio 0.93.1 also with the Aeotech gen 5 stick. Having the same experience with Zwave after migrating from Rpi. From my troubleshooting I have seen that even when the Zwave nodes are not showing up, activity shows in the zwave logs. Occasionally there are no nodes present under Z-Wave Network Management, other times there is only Node1 with an unknown status.
Some of the time I am able to stop and restart the z wave network from within Z-wave management page, other times each button responds with a failed message. I have tried rebooting from command line, gui and portainer, and none seem to have a consistant working z-wave network.
Also, once it does boot correctly with z-wave, no issues until a reboot then 50/50. Wondering if the migration of the network is common with others having this issue, or is your z-wave network built while running on your current hardware? If a migration from other hardware is the common trend, then I plan to delete zwcfgxxxx and start factory resetting and re adding nodes fresh.

edit; I did another reboot and under ZWNM only node1:unknown(complete) appeared in the list, but I was able to stop the network, and then start the network with all nodes showing up correctly without having to reboot HA.

geoffreys commented 5 years ago

This issue was resolved for me by updating the NUC to latest BIOS version. Last thing I would have expected to be the cause!

Fishwaldo commented 5 years ago

as mentioned in discord - "Out of Frame Flow messages" are 99.99999% of times this is caused by another program accessing the same serial device as OZW (/dev/ttyUSB0 etc). I believe there was some terminal addon in Hass.io that conflicted with this previously.

You can try running lsof <path to USB device> in the container/host etc of your gateway to see what else is opening the port.

Some previous users have reported Getty (Serial Console), NUT (UPS monitoring daemon) to be other applications/services running that. I've even seen some users restart HASS, and the old copy of HASS is still shutting down when the new copy is starting up, causing two copies OZW (for a small period of time) accessing the same serial port.

Fishwaldo commented 5 years ago

also - due to symlinks etc in the /dev/ tree - lsof /dev/ttyUSB0 etc might not always return everything. you should also try 'lsof | grep /dev/' to see help filter. (eg, if you use the /dev/serial/by-id/ paths instead)

KennethLavrsen commented 5 years ago

I doubt that it is another program that access the serial because why would that only happen when a docker container with hass.io restarts? I would expect sporatic errors randomly if there is some UPS service playing along. I could also expect something like this when you boot the machine but this is happening both at cold boot and at restart of HA. I am more thinking that it could be HA that closes down the serial port and then start up again before the old communication is over.

Someone said that the Aeotec stick is know to continue sending data when the z-wave net is active.

The probability that the error occurs is much higher when you restart many times at short interval (the typical situation where you try to get a new thing setup in configs). In this situation the Z-wave net is restarted again and again and will be talking to the stick while HA is restarting again.

I think the HA devs should try and make a trial with a delayed start of Z-wave including delaying access to the serial to allow the old HA instance to die and the communication on the stick to die. It would also be good if HA would retry a minute later if the init fails.

Two instances of OZW could explain why I have seen multiple failures that I could not recover from without restoring the zwcfg... file from backup. I have had several situations where many many restarts and cold reboots never recovered and I had to resort to a snapshot. For the same reason I make many snapshots when I change something

Adminiuga commented 5 years ago

just do sudo lsof | grep USB and check. Bet you have ModemManager hogging the port from time to time. ZHA had same issues.

Fishwaldo commented 5 years ago

I will happily place money that the issue is something else is opening/opened the serial port when you get these Out Of Frame Messages. it may be a old copy/container of hassio shutting down, it may be something else... but every single time I have seen it (hundreds maybe?) its been duplicate copies of a application, or a 3rd party application.

as far as your question goes about only happening upon restarts? Some applications will try, and error out as well, so it happens upon "booting" the container and the other program errors out (because its not talking to what it thinks it is) and exit.

If your handy with Linux, something like bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }' running when you restart your containers should show you what is opening what on your system.

(run that on your host, not in the container)

KennethLavrsen commented 5 years ago

I gave up on Hassio and Z-wave and moved to Zwave2MQTT so I cannot test any more

I did try the lsof before moving my zwave to zwave2mqtt and it showed a long list of hassio process number but never any third party stuff

It would be strange if the situation was that every time a hassio docker instance restarts some 3rd party process would wake up and talk on the serial and then never again. A third party program from the host OS would either do it at host boot or randomly all the time. This only happens when hassio is restarted.

Let us wrap up the facts

Remember that the Aeotec is a battery powered device that seems to have some life even when unplugged and it may start talking on the serial the minute the serial is up. There may not even be any flow control

My theory is that the stick continues to talk on the serial during restarts. Maybe not for long but long enough to confuse the new instance of hassio. Additionally I think hassio may not close down OZW very nicely. Maybe the communication with stick is not shut down correctly. Maybe that is also what causes a corrupt OZW data file.

There is something hassio does not do right with open zwave. And it seems the Raspberry Pi is simply so slow that it does not matter. Or the USB chipset on Raspberry handles things differently when a process dies.

I would look at how hassio shuts down open zwave and how it handles failure modes during startup.

I would also look for multiple hass.io processes trying to talk to the Aeotec via OZW at the same time. And again timing seems to be important since the slower raspberry does not show the problem

Adminiuga commented 5 years ago

@KennethLavrsen yes, it happens after restarts because hassio HA container is triggering udev, which in turn may trigger other applications like Ubuntu's Modem manager.

KennethLavrsen commented 5 years ago

Hmm. That is a good idea. Unfortunately I no longer have a HA with an Aeotec stick but if someone still having the HW and the problem could try this. I assume the thing to do is to disable modem manager in Ubuntu host. I cannot test these but here is what I found on Google search

https://linux-tips.com/t/prevent-modem-manager-to-capture-usb-serial-devices/284

https://askubuntu.com/questions/399263/udev-rules-seem-ignored-can-not-prevent-modem-manager-from-grabbing-device

https://www.freedesktop.org/software/ModemManager/api/latest/ref-overview-modem-filter.html

https://stackoverflow.com/questions/24696527/modem-manager-and-ttyacm-in-use

There are plenty more. If this is the root cause then there is still a todo for HA because it should simply wait and try again if it fails like it does with so many other devices. The most primitive work around would be to wait one minute and try again.

dgooglr commented 5 years ago

Hi. So I have now tried this: https://linux-tips.com/t/prevent-modem-manager-to-capture-usb-serial-devices/284

(https://askubuntu.com/questions/399263/udev-rules-seem-ignored-can-not-prevent-modem-manager-from-grabbing-device) I believe this is really the way to go..

But what I ended up doing was systemctl disable ModemManager.service

Have now restarted NUC 2 times, and many restarts of Hassio. Seem to work fine. Will continue testing it

warcanoid commented 5 years ago

@dgooglr found the solution! I had this problems from start with zwave on NUC. Now I have restared about 20x HA, rebooted NUC 10x, power off on NUC 5x, fast restart of HA-without wait to zwave ready 15x, and zwave works! This reboot causes before the lost zwave! HA also load faster now, very fast! And no more slow start of zwave!

KennethLavrsen commented 5 years ago

I decided to put my zwave back on HA. I need to restart many many more times or a period of time to confirm the fix. So far it looks good.

The fix was from the host Ubuntu to diable modem manager with systemctl disable ModemManager and then just reboot the whole machine

I will report back. If this works out the short term is to get this in the Zwave documentation and hopefully later someone can improve HA so it retries after a minute if the startup of Zwave fails. It was @Adminiuga saying the word ModemManager that made this breakthrough

warcanoid commented 5 years ago

We have the fix now! Finnaly!

KennethLavrsen commented 5 years ago

Since I disabled Modem Manager I have not had a single event of startup problems after boot or restart.

The easy work around is to disable Modem Manager.

systemctl disable ModemManager.service

The documentation for HA has had this addressed now but the documentation suggests to purge the ModemManager from the system. This however can have negative effects on distros with Gnome desktops so best advice is to disable the service.

I still believe HA should be coded to take an approach that allows people that may need ModemManager or simply have not seen or understood the documentation. Modem Manager only runs for a few seconds so HA could simply retry maybe 30 seconds after a failed init and the user would be happily ignorant about the issue. There are so many other integrations that have mechanisms to reconnect when the link fails. That would be the best solution. But at least with the documentation update people can now use the Aeotec Z stick.

thundergreen commented 5 years ago

Thanks people for figurng this out. Something funny: I had deconz installed and accidently used the same dev path for the deconz stick which provoked my error messages .. Now with disabled modemManager and correct configuration of my deconz USB0 device everyting is back to normal :P

electrofloat commented 5 years ago

So the documentation was fixed in: https://github.com/home-assistant/home-assistant.io/pull/9613 for Debian/Ubuntu based systems, and in https://github.com/home-assistant/home-assistant.io/pull/9638 for Hass.io, Docker.

Imho this issue should be closed.

KennethLavrsen commented 5 years ago

Yes. I am OK with that.

ondranen commented 4 years ago

Hi guys, i've got similar issue with Raspberry 3. However there is no ModemManager installed. Is there any other link or solution? Thanks for your help. regards ondrej